Closed katehausladen closed 1 year ago
I've put some thought into this too and we could search for text like "Access Denied" or "Verify you are a human" by using a method similar to how we search for Do Not Sell links. Also, while we have access to @sophieeng's IP address, we could have her run a crawl and see if sites are still blocked.
some more examples:
I wrote some code to click the "Verify you are a human" button by Cloudflare (as in the cinemark.com example). When the button is clicked, the same captcha page loads again. So since clicking the button alone does not work, Cloudflare is definitely detecting that we are using Selenium. I haven't found any packages that would help with this for Selenium with Nodejs and Firefox, and since it's only impacting ~20 sites, I'm not sure it's worth spending a ton of time trying to bypass. We can discuss this more at the meeting.
The crawler won't restart after these errors, so it doesn't affect the overall flow.
The crawler continues crawling (as a clarification).
impacting ~20 sites
For the total about 1,800 sites, though, the 20 are specifically for the Cloudflare human-checks.
We decided in the last meeting that we will not try to bypass any human checks. We will add new titles to the regex in the visit_site function in local-crawler.js as we see more sites like this. The readme is also updated to include this in section 4.
I've noticed that there are some sites that go to a page that says some iteration of "Access Denied" or "Verify you are a human." I think this is mostly caused by the VPN (i.e. the VPN IP address is blocked/recognized), but the crawler probably contributes, too. It seems like there is some variability in the sites run to run, particularly in the "Verify you are a human" category. We want to:
Currently, I am identifying them by the title of the site (i.e. "Access Denied" and "Just a moment..." in the pictures below).
The sites I do identify are logged as either a "VerifyHumanError" or a "AccessDeniedError". The crawler won't restart after these errors, so it doesn't affect the overall flow. In the latest full crawl, the sites that asked to verify you were a human were: VerifyHumanError: [ 'https://www.legacy.com', 'https://www.ticketsatwork.com', 'https://www.fixya.com', 'https://www.cameo.com', 'https://www.cardinalcommerce.com', 'https://www.cinemark.com', 'https://www.securecafe.com', 'https://www.lordandtaylor.com', 'https://www.moneytalksnews.com', 'https://www.allegiantair.com', 'https://www.newspapers.com', 'https://www.rentcafe.com', 'https://www.babylonbee.com', 'https://www.fleetfarm.com', 'https://www.jegs.com', 'https://www.appurse.com', 'https://www.123-movies.com', 'https://www.camelcamelcamel.com', 'https://www.muscleandstrength.com'] And the sites with Access Denied were: AccessDeniedError: [ 'https://www.sprint.com', 'https://www.kroger.com', 'https://www.petsmart.com', 'https://www.tacobell.com', 'https://www.subway.com', 'https://www.zoosk.com', 'https://www.officedepot.com', 'https://www.hotwire.com', 'https://www.meijer.com', 'https://www.jcrew.com', 'https://www.backcountry.com', 'https://www.littlecaesars.com', 'https://www.fisglobal.com', 'https://www.fossil.com', 'https://www.flyertalk.com', 'https://www.citizensbank.com', 'https://www.earthlink.net', 'https://www.apartmentfinder.com', 'https://www.demandforce.com', 'https://www.pizzahut.com', 'https://www.t-mobile.com']