Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded

katehausladen commented 1 year ago

I've noticed that there are some sites that go to a page that says some iteration of "Access Denied" or "Verify you are a human." I think this is mostly caused by the VPN (i.e. the VPN IP address is blocked/recognized), but the crawler probably contributes, too. It seems like there is some variability in the sites run to run, particularly in the "Verify you are a human" category. We want to:

identify these sites, since the crawl data will be inaccurate
hopefully find a way to bypass at least some of these issues

Currently, I am identifying them by the title of the site (i.e. "Access Denied" and "Just a moment..." in the pictures below).

OliverWang13 commented 1 year ago

I've put some thought into this too and we could search for text like "Access Denied" or "Verify you are a human" by using a method similar to how we search for Do Not Sell links. Also, while we have access to @sophieeng's IP address, we could have her run a crawl and see if sites are still blocked.

katehausladen commented 1 year ago

some more examples:

katehausladen commented 1 year ago

I wrote some code to click the "Verify you are a human" button by Cloudflare (as in the cinemark.com example). When the button is clicked, the same captcha page loads again. So since clicking the button alone does not work, Cloudflare is definitely detecting that we are using Selenium. I haven't found any packages that would help with this for Selenium with Nodejs and Firefox, and since it's only impacting ~20 sites, I'm not sure it's worth spending a ton of time trying to bypass. We can discuss this more at the meeting.

SebastianZimmeck commented 1 year ago

The crawler won't restart after these errors, so it doesn't affect the overall flow.

The crawler continues crawling (as a clarification).

impacting ~20 sites

For the total about 1,800 sites, though, the 20 are specifically for the Cloudflare human-checks.

katehausladen commented 1 year ago

We decided in the last meeting that we will not try to bypass any human checks. We will add new titles to the regex in the visit_site function in local-crawler.js as we see more sites like this. The readme is also updated to include this in section 4.

privacy-tech-lab / gpc-web-crawler

Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51