privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 3 forks source link

Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51

Closed katehausladen closed 1 year ago

katehausladen commented 1 year ago

I've noticed that there are some sites that go to a page that says some iteration of "Access Denied" or "Verify you are a human." I think this is mostly caused by the VPN (i.e. the VPN IP address is blocked/recognized), but the crawler probably contributes, too. It seems like there is some variability in the sites run to run, particularly in the "Verify you are a human" category. We want to:

  1. identify these sites, since the crawl data will be inaccurate
  2. hopefully find a way to bypass at least some of these issues

Currently, I am identifying them by the title of the site (i.e. "Access Denied" and "Just a moment..." in the pictures below).

Screenshot 2023-06-21 at 10 51 52 AM Screenshot 2023-06-21 at 10 52 36 AM

The sites I do identify are logged as either a "VerifyHumanError" or a "AccessDeniedError". The crawler won't restart after these errors, so it doesn't affect the overall flow. In the latest full crawl, the sites that asked to verify you were a human were: VerifyHumanError: [ 'https://www.legacy.com', 'https://www.ticketsatwork.com', 'https://www.fixya.com', 'https://www.cameo.com', 'https://www.cardinalcommerce.com', 'https://www.cinemark.com', 'https://www.securecafe.com', 'https://www.lordandtaylor.com', 'https://www.moneytalksnews.com', 'https://www.allegiantair.com', 'https://www.newspapers.com', 'https://www.rentcafe.com', 'https://www.babylonbee.com', 'https://www.fleetfarm.com', 'https://www.jegs.com', 'https://www.appurse.com', 'https://www.123-movies.com', 'https://www.camelcamelcamel.com', 'https://www.muscleandstrength.com'] And the sites with Access Denied were: AccessDeniedError: [ 'https://www.sprint.com', 'https://www.kroger.com', 'https://www.petsmart.com', 'https://www.tacobell.com', 'https://www.subway.com', 'https://www.zoosk.com', 'https://www.officedepot.com', 'https://www.hotwire.com', 'https://www.meijer.com', 'https://www.jcrew.com', 'https://www.backcountry.com', 'https://www.littlecaesars.com', 'https://www.fisglobal.com', 'https://www.fossil.com', 'https://www.flyertalk.com', 'https://www.citizensbank.com', 'https://www.earthlink.net', 'https://www.apartmentfinder.com', 'https://www.demandforce.com', 'https://www.pizzahut.com', 'https://www.t-mobile.com']

OliverWang13 commented 1 year ago

I've put some thought into this too and we could search for text like "Access Denied" or "Verify you are a human" by using a method similar to how we search for Do Not Sell links. Also, while we have access to @sophieeng's IP address, we could have her run a crawl and see if sites are still blocked.

katehausladen commented 1 year ago

some more examples:

ask petco zoominfo
katehausladen commented 1 year ago

I wrote some code to click the "Verify you are a human" button by Cloudflare (as in the cinemark.com example). When the button is clicked, the same captcha page loads again. So since clicking the button alone does not work, Cloudflare is definitely detecting that we are using Selenium. I haven't found any packages that would help with this for Selenium with Nodejs and Firefox, and since it's only impacting ~20 sites, I'm not sure it's worth spending a ton of time trying to bypass. We can discuss this more at the meeting.

SebastianZimmeck commented 1 year ago

The crawler won't restart after these errors, so it doesn't affect the overall flow.

The crawler continues crawling (as a clarification).

impacting ~20 sites

For the total about 1,800 sites, though, the 20 are specifically for the Cloudflare human-checks.

katehausladen commented 1 year ago

We decided in the last meeting that we will not try to bypass any human checks. We will add new titles to the regex in the visit_site function in local-crawler.js as we see more sites like this. The readme is also updated to include this in section 4.