ultrafunkamsterdam / nodriver

Successor of Undetected-Chromedriver. Providing a blazing fast framework for web automation, webscraping, bots and any other creative ideas which are normally hindered by annoying anti bot systems like Captcha / CloudFlare / Imperva / hCaptcha
GNU Affero General Public License v3.0
1.42k stars 155 forks source link

Problem running nodriver in headless mode #5

Open KenyOnFire opened 2 months ago

KenyOnFire commented 2 months ago

I was doing tests with the nodriver module, when I tried to test the headless mode and I discovered that when activating this mode, the user-agent is modified and this makes the browser detectable as a bot, I attach the user-agent that is returned to me when using headless. Thank you!

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/128.0.0.0 Safari/537.36

TEMPORALY FIX: Inside the nodriver module there is a class called Config, on line 185 after if self.headless: args.append("--headless=new") I have included a request with the requests module to obtain the latest useragent for chrome without that supposed 'Headless' and thanks to this before the execution the 'Headless' text disappears, I leave the code here in case it helps someone so_key = {"windows": "windows", "linux": "linux", "darwin": "mac"}[platform.system().lower()] ua = next(ua for ua in requests.get("https://jnrbsn.github.io/user-agents/user-agents.json").json() if so_key in ua.lower() and "chrome" in ua.lower() and "firefox" not in ua.lower()) args.append('--user-agent=' + ua)

ioio101 commented 2 months ago

The irony in a library designed to ensure Chrome's stealth as a web scraper, yet inadvertently revealing itself by failing to suppress the very "HeadlessChrome" signature it was supposed to conceal in headless mode.

devblack commented 2 months ago

requests.get("https://jnrbsn.github.io/user-agents/user-agents.json").json()

Hello. That is unnecessary. you can manually replace it with useragent_override and replace() method.

KenyOnFire commented 2 months ago

requests.get("https://jnrbsn.github.io/user-agents/user-agents.json").json(.json())

Hello. That is unnecessary. you can manually replace it with useragent_override and replace() method.

I know that it is not necessary or practical in the long run, but I couldn't apply your logic, could you be more specific about using the useragent_override method since I can't find any documentation about that, besides the idea is that before initializing the browser , carry the useragent without the word Headless like undetected chromedriver does. If you could give me an example code in which you perform this fix, that would be great and I could conclude the thread.

PD: I have also tried this code but it only injects the cdp of the current tab, and not the entire browser async def change_useragent(self, useragent): self.page.feed_cdp(cdp.emulation.set_user_agent_override( useragent )) return await self.page.reload()

boludoz commented 2 months ago

The irony in a library designed to ensure Chrome's stealth as a web scraper, yet inadvertently revealing itself by failing to suppress the very "HeadlessChrome" signature it was supposed to conceal in headless mode.

Just run a javascript that does it or start chrome with the custom agent from the commands and stop crying.

Toxenskiy commented 2 months ago

Study the documentation on user agents