privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
0 stars 0 forks source link

Improve Crawler Error Handling #33

Closed dadak-dom closed 1 month ago

dadak-dom commented 2 months ago

As I've been running tests, I've noticed that there are some errors that the crawler doesn't handle gracefully (for context, most errors are handled well, i.e. the crawler doesn't come to a hard stop). My theory is that, due to how often Firefox Nightly is updated, websites that previously worked might start crashing after a daily update. Thus, before I can continue on #9 , I will improve how the crawler deals with these errors. If successful, this will also be a good improvement to the actual crawl.

dadak-dom commented 2 months ago

Fairly certain I've got a solution ready. Once I've tested a little more, I'll put up a PR.

JoeChampeau commented 2 months ago

In a similar vein, while testing @dadak-dom's changes, I noticed that several sites were displaying bot check pages (i.e. "We detected strange behavior, please verify you are human") that weren't being detected by our HumanCheckError regexes. Our method, which we inherited from the GPC web crawler (privacy-tech-lab/gpc-web-crawler#51), relies on string-matching website titles to check for common phrases that signify a bot detection page, such as "Access Denied" or "Please verify you are human," but this, of course, can never be exhaustive.

We're never going to be able to catch all of them, but it is important to exclude as many as possible because they can't be justified as forming a commonplace component of average user browsing.

Any suggestions on possible improvements? Adding more regexes is always an option, but I also would like to investigate checking for elements with a class that contains the word "captcha," for instances like the ones I discovered when website titles were not being updated to something that could be reasonably detected. Of course, the concern here is that we also don't want to be overly aggressive and exclude false positives.

dadak-dom commented 2 months ago

@JoeChampeau Probably the easiest way that comes to mind is using xpath functions on specific phrases. From what I can see, these sites tend to use similar ways to notify the "user" when they're being blocked, like "You have been blocked for clicking too fast" or something along those lines. I'd imagine that taking the sites that match those criteria should remove most of the offenders, and I'd imagine it's really unlikely that a genuine landing page would say something like "You have been blocked". Of course, we can't guarantee it, but it's better than nothing.

dadak-dom commented 1 month ago

Relevant PR has been merged 👍