readthedocs / readthedocs.org

The source code that powers readthedocs.org
https://readthedocs.org/
MIT License
8.03k stars 3.58k forks source link

Link checker bot (Oh Dear) now blocked #11630

Closed adamchainz closed 3 weeks ago

adamchainz commented 3 weeks ago

Details

I use a link checker tool called Oh Dear. It checks all links on my blog, of which many are links to Read-the-Docs-hosted sites. This checker has given me many valuable fixes over the years, letting me fix links across some major project docs changes.

On 27 Sep, it failed with 147 new errors, all 403 responses from RtD projects:

Xnapper-2024-09-30-16 31 10

I'm guessing you or a provider enabled some new blocking rule that affects Oh Dear, probably to combat AI crawlers per your June blog post. Would it be possible to undo it? Is there something I or Oh Dear can do to pass the checks?

Expected Result

Link checks work.

Actual Result

403 responses.

humitos commented 3 weeks ago

@adamchainz can you provide some of those URL that are returning 403 so we have something to test against? Do those URL always return 403 or they are flaky?

Is there something I or Oh Dear can do to pass the checks?

Is it possible to reduce the amount of requests per minute or similar as a test?

humitos commented 3 weeks ago

@ericholscher on Sep 26, we swapped to the new web-ext-theme ASG. Do you think it could be related somehow? 🤔

Also note that another user reported a similar issue some weeks ago, https://github.com/readthedocs/readthedocs.org/issues/11615, but I wasn't able to find an issue in our side.

adamchainz commented 3 weeks ago

@adamchainz can you provide some of those URL that are returning 403 so we have something to test against? Do those URL always return 403 or they are flaky?

Sure, here's the full list:

broken-links-adamj.eu-20241001023746-37704017990.csv

Is it possible to reduce the amount of requests per minute or similar as a test?

Unfortunately, they don't support any control for this. I believe they are very respectful, though.

humitos commented 3 weeks ago

I did a quick test with those URLs and all of them give me 200 or 302 with a simple curl -ILs. If Oh Dear is hitting rate limit, it should be a 429, instead of a 403, tho. I'm not sure what's happening here.

I also checked the amount of 403 status code in CF and I don't see they have increased after Sep 26 when we performed the ASG change:

Screenshot_2024-10-01_12-09-38

humitos commented 3 weeks ago

Interesting... Starting on Tuesday, Sep 24, all OhDear traffic started to return 403 for some reason.

Screenshot_2024-10-01_12-14-50

humitos commented 3 weeks ago

I tried running the following command with 8 process in parallel:

for url in `cat urls.issue.11630.txt`; do curl -sIL -A "Mozilla/5.0 (compatible; OhDear/1.1; +https://ohdear.app/checker; brokenLinks)" $url | grep HTTP; done

and any of my requests were blocked. All of them returned 200/302 status codes 🤷🏼

adamchainz commented 3 weeks ago

Thanks for looking and sharing @humitos. I wonder if some other header from OhDear is tripping CloudFlare bot protection or something.

@mattiasgeniar, would you be able to advise? Maybe something changed in OhDear. If tagging on GitHub isn't appropriate, I can submit a support ticket.

ericholscher commented 3 weeks ago

This is getting blocked by the CF managed rules for AI bots, it seems:

Screenshot 2024-10-01 at 10 22 34 AM

Which is odd, since it's in their own list as a Monitoring tool:

Screenshot 2024-10-01 at 10 23 03 AM
ericholscher commented 3 weeks ago

I updated our managed rule, which I don't think will change anything, but will be good to test again. Can see the results here (CF dashboard).

ericholscher commented 3 weeks ago

I've added an explicit exemption for Oh Dear as well, so that should hopefully fix it.

adamchainz commented 3 weeks ago

I ran a new check and it came back with 403s again 😞

adamchainz commented 3 weeks ago

Okay, after another run, everything worked. Thank you @ericholscher !

humitos commented 3 weeks ago

Wohoo! Thanks for the feedback. I'm going to close this issue as solved now, but feel free to re-open/comment if it stops working for any reason.