ripple / dactyl

Tools to make enterprise documentation from Markdown sources.
MIT License
69 stars 34 forks source link

Link checker - back off when rate-limited, cache successful remote links across runs #60

Open mDuo13 opened 3 years ago

mDuo13 commented 3 years ago

Right now, running the link checker can fail if the docs have a whole lot of links to the same website, causing it trigger that website's rate-limiting behavior.

When receiving a response code 429 Too Many Requests on a remote link, the link checker should back off from checking links from the same (sub-)domain and try them again later in the run if possible. For example, it might cache the link that got the first 429 result along with any links to the same domain for about 1 minute, then retry all the cached links after that period, backing off again if it starts getting 429 responses again. The actual backoff behavior should follow industry best practices.

While waiting to retry rate-limited links, the checker should check links to other domains

Furthermore, to speed up most runs and reduce the chances of being rate-limited in the future, the link checker should be able to load a cache successfully checked links, with the timestamp of the previous check for that link. The link checker should automatically discard results that are older than a threshold, but keep the other results so that it doesn't have to check those links again during the current run. The suggested threshold for successfully checked links is 7 days, but should be configurable in the dactyl config file. Failed results should not be cached, so they are retried every run unless added to the (already existing) known broken links setting in the config.

The link checker should also be able to save the results of a run to such a file.

Local links should not be subject to any of these behaviors.

Bonus points: adapt the following GitHub Actions job to save & load the cached link checking results on subsequent runs.

name: Link Checker

on: [push]

jobs:
  build:

    runs-on: ubuntu-latest
    strategy:
      max-parallel: 4
      matrix:
        python-version: [3.8]

    steps:
    - uses: actions/checkout@v1
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v1
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install dactyl
    - name: Build repo
      run: |
        dactyl_build
    - name: Run Dactyl Link Checker
      run: |
        dactyl_link_checker