Closed adamchainz closed 3 weeks ago
@adamchainz can you provide some of those URL that are returning 403 so we have something to test against? Do those URL always return 403 or they are flaky?
Is there something I or Oh Dear can do to pass the checks?
Is it possible to reduce the amount of requests per minute or similar as a test?
@ericholscher on Sep 26, we swapped to the new web-ext-theme
ASG. Do you think it could be related somehow? 🤔
Also note that another user reported a similar issue some weeks ago, https://github.com/readthedocs/readthedocs.org/issues/11615, but I wasn't able to find an issue in our side.
@adamchainz can you provide some of those URL that are returning 403 so we have something to test against? Do those URL always return 403 or they are flaky?
Sure, here's the full list:
broken-links-adamj.eu-20241001023746-37704017990.csv
Is it possible to reduce the amount of requests per minute or similar as a test?
Unfortunately, they don't support any control for this. I believe they are very respectful, though.
I did a quick test with those URLs and all of them give me 200 or 302 with a simple curl -ILs
. If Oh Dear is hitting rate limit, it should be a 429, instead of a 403, tho. I'm not sure what's happening here.
I also checked the amount of 403 status code in CF and I don't see they have increased after Sep 26 when we performed the ASG change:
Interesting... Starting on Tuesday, Sep 24, all OhDear traffic started to return 403 for some reason.
I tried running the following command with 8 process in parallel:
for url in `cat urls.issue.11630.txt`; do curl -sIL -A "Mozilla/5.0 (compatible; OhDear/1.1; +https://ohdear.app/checker; brokenLinks)" $url | grep HTTP; done
and any of my requests were blocked. All of them returned 200/302 status codes 🤷🏼
Thanks for looking and sharing @humitos. I wonder if some other header from OhDear is tripping CloudFlare bot protection or something.
@mattiasgeniar, would you be able to advise? Maybe something changed in OhDear. If tagging on GitHub isn't appropriate, I can submit a support ticket.
This is getting blocked by the CF managed rules for AI bots, it seems:
Which is odd, since it's in their own list as a Monitoring tool:
I updated our managed rule, which I don't think will change anything, but will be good to test again. Can see the results here (CF dashboard).
I've added an explicit exemption for Oh Dear as well, so that should hopefully fix it.
I ran a new check and it came back with 403s again 😞
Okay, after another run, everything worked. Thank you @ericholscher !
Wohoo! Thanks for the feedback. I'm going to close this issue as solved now, but feel free to re-open/comment if it stops working for any reason.
Details
I use a link checker tool called Oh Dear. It checks all links on my blog, of which many are links to Read-the-Docs-hosted sites. This checker has given me many valuable fixes over the years, letting me fix links across some major project docs changes.
On 27 Sep, it failed with 147 new errors, all 403 responses from RtD projects:
I'm guessing you or a provider enabled some new blocking rule that affects Oh Dear, probably to combat AI crawlers per your June blog post. Would it be possible to undo it? Is there something I or Oh Dear can do to pass the checks?
Expected Result
Link checks work.
Actual Result
403 responses.