ulixee / secret-agent

The web scraper that's nearly impossible to block - now called @ulixee/hero
https://secretagent.dev
MIT License
675 stars 45 forks source link

fingerprint avoidance: getting started with browser emulators and changing viewports, and user-agents #362

Open andynuss opened 3 years ago

andynuss commented 3 years ago

My understanding of secret-agent is that one should let secret-agent do most of the heavy lifting for browser emulation with the goal of avoiding fingerprinting.

For a newbie relative to avoiding bot-detection, it seems like the most important things are the user-agent that "seems" to be in use, the viewport that seems to be in use, the fonts that seem to be installed, and various device properties of the os that "seems" to be in use, versus the real OS our scraper is running on (in our case, ideally an ec2 instance).

If we simply create a new Agent without any browser emulator plugin guidance, and specifying some reasonably random viewport of our choice, are we getting secret-agent's anti-fingerprinting by default? I.e. what user-agent will we get for each Agent constructed, and what viewport?

Assuming that the more aggressive you are in using secret-agent to create a unique fingerprint for your constructed agent, is it advisable to reuse the agent for several consecutive scrapes (i.e. for 10 or so page urls, waiting for a new tab, and using that tab to scrape the url, and the closing the tab)? (on the assumption that it can take significant time to create a new agent that is "bot-proof").

Is there a proper way to guide secret-agent with the use of the DefaultBrowserEmulator to select our own reasonable user-agent and viewport? Also, in terms of viewport specifically, I noticed that secret-agent distinguishes between width/height and screenWidth/screenHeight, but playwright only has a single width and height. Can you explain that?

Is there anything we can do to help avoid being fingerprinted based on the lack of fonts installed on our linux ec2 instance?

Is it important not to use a really slow (cheap) aws ec2 machine whose cpu is only a fraction of a full cpu (such as m3.medium, or even cheaper instances, or even lambda)? I.e. this would lead to the page load event (followed by "scrape") taking quite a long time compared to what is "normal", such as inflating 10 seconds to as many as 60 for a slow-loading page.

blakebyrnes commented 3 years ago

If we simply create a new Agent without any browser emulator plugin guidance, and specifying some reasonably random viewport of our choice, are we getting secret-agent's anti-fingerprinting by default? I.e. what user-agent will we get for each Agent constructed, and what viewport?

Secret Agent will automatically rotate your user-agent within the installed browsers, and will move your browser window around the screen a little - it uses the most popular viewport from https://gs.statcounter.com/. We chose not to change the viewport size by default because it can cause scripts to behave differently on different runs. You can do so, but just make sure to test it out.

Many checks are bot-blocker and website specific. Some sites check that your user agent matches a browser and OS exactly, some just put you in a bucket based on statistics they grab from things like fingerprintjs.

Your IP is the most frequently checked piece of information beyond the browser itself. You can rotate using VPNs or Proxies, and some sites check that your IP generally matches the timezone and language settings you use.

Assuming that the more aggressive you are in using secret-agent to create a unique fingerprint for your constructed agent, is it advisable to reuse the agent for several consecutive scrapes (i.e. for 10 or so page urls, waiting for a new tab, and using that tab to scrape the url, and the closing the tab)? (on the assumption that it can take significant time to create a new agent that is "bot-proof").

This is very dependent on your site. Some will want to see cookies and local storage in use, so will be aggressive if they think you're "new" to their site. In this case, you'll want to use the userProfile feature to create some user profiles, and/or have some setup script activity that generates a profile (eg, visit several pages on the site first).

Is there a proper way to guide secret-agent with the use of the DefaultBrowserEmulator to select our own reasonable user-agent and viewport? Also, in terms of viewport specifically, I noticed that secret-agent distinguishes between width/height and screenWidth/screenHeight, but playwright only has a single width and height. Can you explain that?

You can configure both everytime you create an agent, but you likely only want to change up viewport if you're seeing issues with that, and you likely do not want to change user agent unless you're seeing issues. It will automatically have some entropy applied by SecretAgent.

Playwright is likely just not concerned in the slightest with the size of your screen or position of your browser... those don't matter for testing sites for the most part. They're part of the Devtools api and part of what can be checked by the browser though, so we vary them.

Is there anything we can do to help avoid being fingerprinted based on the lack of fonts installed on our linux ec2 instance?

There are many font packages out there you can install. Just varying them should be enough in general - I don't know of bot blockers using OS specific fonts too heavily.

Is it important not to use a really slow (cheap) aws ec2 machine whose cpu is only a fraction of a full cpu (such as m3.medium, or even cheaper instances, or even lambda)? I.e. this would lead to the page load event (followed by "scrape") taking quite a long time compared to what is "normal", such as inflating 10 seconds to as many as 60 for a slow-loading page.

CPU detection is very rare in scraping detection (from what i know), but there are a few bot detectors doing red pills for virtual machines - there's some research into this that the very aggressive bot blockers tried a bit, but it's not 100% reliable, so I think it's still somewhat rare.

andynuss commented 3 years ago

Thanks for the answers above. I will study them some more, but before I ask any followup questions:

... tucked in my questions above was the idea of using a single agent to scrape several urls before closing the agent. This is similar to the idea of creating a playwright page from a browser, and then closing the Page before going on to the next url to scrape for the "browser" instance.

This is especially important for me because I have set a resource listener on my agent's Tab.

So I tried implementing the following technique for 10+ consecutive urls:

  1. optionally call the agent.configure() function before the next scrape (i.e. to maybe change the viewport)
  2. call agent.waitForNewTab to get a closeable tab when I am done with the scrape session
  3. use tab.on('resource').then() to view the resources that are loaded when I visit the page
  4. use tab.goto(url)
  5. inject javascript to get various things in the dom and its frames that I am interested in
  6. close the Tab created in 2. above so that I am making a clean demarcation for the next goto and its resources

The problem is that I am getting a timeout on the tab.goto(url) call (step 4), making me think that my design pattern is not even possible because waitForNewTab() does not in fact act like playwright browser.newPage() function.

Is there a way in secret-agent to accomplish the encapsulation that I am trying to do, allowing some kind of async close() call prior to each new url scraped for a given agent?

blakebyrnes commented 3 years ago

WaitForNewTab is currently more about popups or Control clicking a link to force it to a new tab. I think your approach will be simpler to just create a new agent for each of your scrapes and/or just skip the tab close portion. You can always use a userProfile to restore any state you want between the agents.

On Oct 22, 2021, at 6:36 PM, andynuss @.***> wrote:

 Thanks for the answers above. I will study them some more, but before I ask any followup questions:

... tucked in my questions above was the idea of using a single agent to scrape several urls before closing the agent. This is similar to the idea of creating a playwright page from a browser, and then closing the Page before going on to the next url to scrape for the "browser" instance.

This is especially important for me because I have set a resource listener on my agent's Tab.

So I tried implementing the following technique for 10+ consecutive urls:

optionally call the agent.configure() function before the next scrape (i.e. to maybe change the viewport) call agent.waitForNewTab to get a closeable tab when I am done with the scrape session use tab.on('resource').then() to view the resources that are loaded when I visit the page use tab.goto(url) inject javascript to get various things in the dom and its frames that I am interested in close the Tab created in 2. above so that I am making a clean demarcation for the next goto and its resources The problem is that I am getting a timeout on the tab.goto(url) call (step 4), making me think that my design pattern is not even possible because waitForNewTab() does not in fact act like playwright browser.newPage() function.

Is there a way in secret-agent to accomplish the encapsulation that I am trying to do, allowing some kind of async close() call prior to each new url scraped for a given agent?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

andynuss commented 3 years ago

Sounds good.

andynuss commented 3 years ago

I was wondering how long it took on average to create an agent (the default way) on my ec2 instance, and after the vm warms up, it appears to fluctuate between 2.0 and 4.5 seconds. This does seem significant, and it seems like it would be good to avoid creating a new agent for each page.

I assume that if it wasn't for the fact that it is important for me to listen to which web-fonts are being requested for a given goto event/page, then I could simply navigate to another page with the same agent. So what I was thinking was that I could navigate (with the agent.goto method) to my own empty localhost page url (on that ec2 instance) after I decide I am done with the information I need for page N, then reset the fontList for the next page, and then immediately navigate to page N+1. Will this allow me to reuse an agent for several agent.goto() calls in a row without being confused as to which resources go to which?

blakebyrnes commented 3 years ago

Seems reasonable. We could probably also surface information about which "document" is requesting a given resource. Is that 2-4.5 seconds under load? Our unit test suite running on linux is running on small-ish machines, and I'm under the impression it's launching new "browser contexts" far faster than that give the number of tests and time it takes to complete. That said, I haven't done any real analysis, so maybe I'm wrong...

andynuss commented 3 years ago

Concerning my test on the reported slowness of creating an agent:

  1. the slowness report for the agent constructor was wrong. That is always very fast. The problem was that my async newAgent function also included a configure, AND an optional call to 'await agent.meta'. My test had not turned off the request for the IAgentMeta, and this is what was taking the unexpected time.
  2. ... in fact, I discovered that on my slowish (but really cheap) ec2 instance (m3.medium, one virtual cpu which can actually be shared with other customers), my very first call to 'await agent.meta' was taking anywhere from 20 to 40 seconds! After that, I assume it gets a lot faster because of V8 warmup of functions
  3. but also, I discovered that agent.close(), which I had forgotten to measure in my previous tests, was indeed prohibitively slow.

noting that generally there are two simultaneous scrape sessions sharing the same 1 cpu m3.medium, here is a sampling of wallclock times

millis (total, new agent, close agent, goto, eval) = (16778, 0, 4292, 6501, 2246) for url: https://twitter.com/nimishdubey/status/1278289269648285696
millis (total, new agent, close agent, goto, eval) = (34332, 1, 1483, 9819, 4476) for url: https://www.bloomberg.com/news/articles/2020-06-30/apple-cancels-arcade-games-in-strategy-shift-to-keep-subscribers
millis (total, new agent, close agent, goto, eval) = (6458, 0, 1916, 3462, 957) for url: https://www.bbc.co.uk/news/magazine-36321692
millis (total, new agent, close agent, goto, eval) = (13406, 0, 2068, 7674, 3432) for url: https://techpoint.africa/2020/01/13/check-your-nin-with-a-ussd-code/

NOTE: the goto time is the actual call to agent.goto, plus the call to agent.activeTab.waitForLoad, plus the call to my custom plugin function to wait an additional time for all the iframes up to two levels to load. And the eval time is the time to do my scraping of html and stylesheets from the main frame and other iframes.

andynuss commented 3 years ago

In other words, the agent.close() is slow enough that it will make sense to re-use the same agent for successive request urls as I talked about above. But I noticed that you mentioned the importance for some sites to do something even stronger than re-use an agent for say 20 consecutive scrapes. That somehow I need to ensure that when I scrape medium.com or guardian.com, I somehow keep cookies and local and idb storage accumulating from the last scrape I did for that domain.

The problem is, a given ec2 instance may only scrape 300 total urls before being rotated. And in this issue, I was talking about rotating an "agent" after just 20 scrapes. I guess I can figure out some way to load and save profiles from my own central server, but how exactly do I apply a given saved profile to a newly opened agent? And how do I save a profile for an agent, and then for a different url, load a different profile? It looks like this going to force me to this model: create a new agent, apply a saved profile for this domain, scrape, get a snapshot of the profile after the scrape and save to server, and then close the agent. That will probably make the lifecycle part of the scrape pretty expensive in comparison to the goto/load time + the page eval time.