nextstrain / .github

8 stars 11 forks source link

docs-ci: `make linkcheck` prone to transient network failures #106

Open victorlin opened 2 weeks ago

victorlin commented 2 weeks ago

I've just run into this error on an Augur PR which did not change any docs links:

(api/developer/augur.merge: line    7) broken    https://www.gnu.org/software/bash/manual/bash.html#ANSI_002dC-Quoting - HTTPSConnectionPool(host='www.gnu.org', port=443): Max retries exceeded with url: /software/bash/manual/bash.html (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4835cbc1c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
build finished with problems.
make: *** [Makefile:20: linkcheck] Error 1

This seems like a transient network error which shows up as a failing check ❌ on the PR which confused me at first. make linkcheck is a recent addition (#104), so it's hard to tell how often we will run into this. If it happens often, it might be worth splitting linkcheck into a separate job on docs-ci and using continue-on-error: true.

genehack commented 2 weeks ago

This seems like a transient network error

FWIW, I did see these types of errors occasionally (locally) while I was working on correcting links across the various repos.

Thanks for creating the issue; if this happens frequently, I'll handle the split/continue-on-error changes.

victorlin commented 2 weeks ago

Documenting another occurrence:

(installation/installation: line    9) broken    http://www.microbesonline.org/fasttree/ - 403 Client Error: Forbidden for url: http://www.microbesonline.org/fasttree/
(releases/changelog: line  646) broken    https://github.com/nextstrain/augur/pull/1033 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1033
(releases/changelog: line  642) broken    https://github.com/nextstrain/augur/pull/1034 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1034
(releases/changelog: line  626) ok        https://github.com/nextstrain/augur/pull/1070
(releases/changelog: line  598) broken    https://github.com/nextstrain/augur/pull/1039 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1039
(releases/changelog: line  643) broken    https://github.com/nextstrain/augur/pull/1042 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1042
tsibley commented 2 weeks ago

And another, twice in a row.

genehack commented 2 weeks ago

And another, twice in a row.

BOOOOOO.

I will pick this up and make it continue-on-error: true in the next work cycle.

victorlin commented 1 week ago

I'm wondering if continue-on-error: true is the right solution here. With this setting as-is, "real" linkcheck issues are likely to go unnoticed.

On the other hand, with something like the CI failures we have currently or mainmatter/continue-on-error-comment, I'm worried that it could be unnecessarily noisy given the high rate of these failures as of lately (I've seen many in Augur, but no longer linking them back to here).

Some alternatives:

  1. If the network failures are only on a few URLs/domains, configure linkcheck to ignore those domains
  2. Don't run linkcheck in CI but instead on a weekly schedule with retries + cooldown periods in between each try. This would reduce the impact of transient network failures while making sure links are valid.

I realize this comment is coming a bit late but it's longer-term thinking. continue-on-error: true should be good to reduce CI failures in the short-term.

tsibley commented 1 week ago

I generally agree with @victorlin here.