ulixee / hero

The web browser built for scraping
MIT License
647 stars 32 forks source link

cpu utilization, are there knobs? #253

Open andynuss opened 3 months ago

andynuss commented 3 months ago

I am asking a question here about a broad topic, because of something quite noticeable in my scraping sessions as compared the old secret-agent codebase and this new one, and one big change that I have made:

  1. I have been using m3.mediums on AWS, which really behaves like a machine with about 70% of a 1 cpu
  2. I have always been trying to scrape two urls concurrently on this ec2 instance with a bit less than 1 cpu
  3. unlike before, the cpu utilization is absolutely "pegged", probably about 98%.
  4. proof 1: ssh-ing into ec2 instance, I takes forever to cat a medium sized log file
  5. proof 2: there are lots of missed pings, health-checks, and my instances are thought unhealthy, and terminated too soon

I am going to quickly make changes to introduce some self-imposed yielding. And I am going to get less aggressive with my one big change to decide when a page has "loaded": instead of basing it on resource activity dial down, as I used to, I am basing it on a heuristic of percent change to the dom. But this involves repeated injections to get the dom (I didn't think that would be expensive) and a moderately cpu-expensive approach to compare the two doms.

This combination should ease the load on the cpu to bring it back to the realm of acceptable, but I was wondering about changes in Hero vs Secret Agent codebases that might involve more cpu tasks going on than before, even if we don't use them? Specifically, the "TimeTravel" feature?

Are there knobs that exist (or could be added) that allow us to disable features like this that might consume significant CPU? Aside from the immediate issue I was facing above, a "bare-metal" use of devtools has to be better for mitigating against bot detection, right?

blakebyrnes commented 3 months ago

Most of the time travel was in secret agent too, but there are some small parts that might be making things worse, and/or some pages that could be causing issues when there are a ton of dom updates. We don't have controls yet to turn off time travel or to turn it down, but it seems to be a high priority to do something. Thanks for reporting this.