peter-evans / link-checker

A GitHub action for link checking repository Markdown and HTML files
MIT License
49 stars 11 forks source link

Workaround for Too Many Requests (HTTP error 429) #29

Open mitm001 opened 4 years ago

mitm001 commented 4 years ago

I use Antora for building my doc site. When it builds pages, it adds the same table of contents and header to every page. Every link has the same class of nav-link. As my site is over 250+ pages, this means that there are thousands of these duplicated links.

Sites will start timing things out after so many hits so I get thousands of these "Too Many Requests (HTTP error 429)" with default of 512 concurrent HTTP requests. I reduced this down to 32 to slow things down and this reduces the errors down to the hundreds.

I skip the links that are never going to change in the header using a regex but the ones in the TOC are always changing.

Are there any other configurations I could take advantage of to reduce these errors from the TOC? Like maybe skipping based off a class in the href?

peter-evans commented 4 years ago

Hi @mitm001 I'm afraid I don't know of any configurations that could help you. As you know, this action is a simple wrapper around Liche. It would probably be best to raise this issue there instead. Perhaps it's related to this issue https://github.com/raviqqe/liche/issues/37.

mitm001 commented 4 years ago

Yep, thats what I will do.

ionut-arm commented 3 years ago

Hi,

We're using your Github Action for our documentation as well (thanks!) and have started seeing this problem with github.io links - looking at the Liche repo I noticed a CLI option:

-c, --concurrency <num-requests>  Set max number of concurrent HTTP requests. [default: 512]

If that was configurable through the GA yaml it could probably help with the TMR error. Sure, if you have thousands of links to check you'll end up with a long run, but that's kinda what rate limiting is looking to do...

I'm not sure if that issue number 37 applies to us because rate limiting on Github's side sounds deterministic, while our errors are not.

peter-evans commented 3 years ago

Hi @ionut-arm

Good point. Liche arguments are configurable via the args input, so I think the following example should work. I don't know what a suitable number of concurrent requests to try and avoid this issue are, though. That would just require some experimentation.

    - name: Link Checker
      uses: peter-evans/link-checker@v1
      with:
        args: -v -r -c 48 *
MichaIng commented 3 years ago

Even with concurrency 1 it fails, as it seems to be not (only) about the amount of concurrent connections but about the number of connections in a specific time range: https://github.com/raviqqe/liche/issues/42 Probably due to keep alive requests. Basically it would require to add a delay between checking the same host another time 🤔.

peter-evans commented 3 years ago

Liche was recently deprecated and as a result I've also decided to deprecate this action in favour of lychee-action, which is a fork of this project based on lychee. Please consider using that action.

According to the readme:

For GitHub links, it can optionally use a GITHUB_TOKEN to avoid getting blocked by the rate limiter.