internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
648 stars 96 forks source link

Add stealth parameter to avoid antibot systems #246

Closed vbanos closed 2 years ago

vbanos commented 2 years ago

The aim is to prevent Brozzler detection and blocking by antibot systems. To do that, we need to run some JS before any other code runs on page load and mock specific browser attributes which indicate that Brozzler is a bot.

We add the option stealth in Browser, brozzler.cli and BrozzlerWorker. It is disabled by default.

If enabled, we run stealth.js which is executed before anything else on the page via Page.addScriptToEvaluateOnNewDocument.

For now, we mock only the graphics driver attributes. If this is OK, we can add more antibot evasions in the same script.

There are many antibot tests, we are using this: https://bot.sannysoft.com/

Inspired mainly by: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

vbanos commented 2 years ago

Check out the "WebGL Vendor" and "WebGL Renderer" fields with and without stealth mode. The mock values are from my desktop Ubuntu 22.04 using an NVIDIA Quadro graphics card. Brozzler is running on a VM using XVNC.

screenshot-stealth

screenshot