Closed sbrl closed 10 months ago
Really cool @sbrl. I though of this idea just today!
https://mailutils.org/manual/html_chapter/Programs.html#frm-and-from — @sahilister found this just 27 days ago!
A lot of those links are false actually, because you have curl's -I
flag, which sets the method to HEAD
, and that causes 404s/500s on sites like https://crates.io/ and https://redis.io.
Oh really? That's very unusual. According to the HTTP spec HEAD should return exactly the same HTTP headers as a GET
, just without the body IIRC.
We can remove links that actually work from the list.
Updated my comment with a revised list
I see that. Looks good.
@sbrl, the description reads:
Broken links are defined as having a HTTP status code lower than 200, or greater than or equal to 400
But these have 3xx (redirect) status codes:
https://nixos.org/releases/nix/latest/manual/#sec-nix-collect-garbage
https://nixos.org/releases/nix/latest/manual#sec-nix-build
https://docs.microsoft.com/windows-server/administration/windows-commands/reg-flags
That is very cool 👍
We should add this to tldr-lint
🤔
But these have 3xx (redirect) status codes:
That's odd. I get an error when I load those links:
That is very cool :+1:
Thanks!
We should add this to
tldr-lint
:thinking:
That'd be cool, but also annoying when tldr-lint randomly fails because of a link that just went dead. It's best I think if its run periodically.
That's odd. I get an error when I load those links:
So do I...I don't know what happened! 🤷
That'd be cool, but also annoying when tldr-lint randomly fails because of a link that just went dead. It's best I think if its run periodically.
Just for reference (because I just discovered this): https://github.com/tldr-pages/tldr/pull/2684#pullrequestreview-195338362
Here's the one-liner:
find . -type f -iname '*.md' -print0 | xargs -0 cat | awk '/> More information/ { match($0, /<(.*)>/, arr); print(arr[1]); }' | sort | uniq | shuf | xargs -n1 -I{} bash -c 'url="{}"; code="$(curl --user-agent "curl; bash; xargs; tldr-pages-bad-url-checker (+https://github.com/tldr-pages/tldr; implemented by @sbrl)" -sSL -o /dev/null -w "%{http_code}" --fail -I "${url}")"; echo "${code} ${url}" >&2; if [[ "${code}" -lt 200 ]] || [[ "${code}" -ge 400 ]]; then echo "${url}"; fi' >/tmp/bad-urls.txt;
@sbrl That's some lovely one-liner :heart: I especially like your user agent :100:
@sbrl https://gist.github.com/vladimyr/d7bb314b14a59ca5fbefda3498fac81a :wink:
Sneak peak (didn't have time to run it fully :upside_down_face:):
❯ tail -f bad-links.txt
common/autossh.md https://harding.motd.ca/autossh 0000
common/blender.md https://docs.blender.org/manual/en/latest/render/workflows/command_line.html 404
common/bw.md https://help.bitwarden.com/article/cli/ 500
common/cargo.md https://crates.io/ 404
common/clementine.md https://www.clementine-player.org 405
common/deemix.md https://deemix.app 0000
common/dexdump.md https://manpages.ubuntu.com/manpages/latest/en/man1/dexdump.1.html 404
common/glab.md https://clementsam.tech/glab/ 404
common/mutagen.md https://mutagen.io 405
common/mytop.md http://www.mysqlfanboy.com/mytop-3 0000
common/parquet-tools.md https://github.com/apache/parquet-mr/tree/master/parquet-tools 404
common/redis-cli.md https://redis.io/topics/rediscli 500
common/redis-server.md https://redis.io 500
common/redshift.md https://jonls.dk/redshift 0000
A lot of those links are false actually, because you have curl's
-I
flag, which sets the method toHEAD
, and that causes 404s/500s on sites like crates.io and redis.io.
I don't know for Redis but crates.io won't let you scrape their content. You'll get the same response even with regular GET
because they maintain UA allowlist on their end.
@vladimyr Nice one! Don't forget to sleep though in between each call. Many of the links are GitHub, and you'll get rate-limited if you hit GitHub too fast. Alternatively you could group the URLs by domain and do each group of urls in parallel to go real quick, but to do that I'd probably write it in Node.js instead with phin with async/await.
Edit: Also maybe a HEAD request would reduce remote load too?
I don't know for Redis but crates.io won't let you scrape their content. You'll get the same response even with regular GET because they maintain UA allowlist on their end.
Yeah, @sbrl it would probably be better to use a standard UA like
Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36
thought yours is nice, it wouldn't work for some sites like @vladimyr mentioned.
@vladimyr Nice one! Don't forget to sleep though in between each call. Many of the links are GitHub, and you'll get rate-limited if you hit GitHub too fast.
I like living on the edge 😂
Alternatively you could group the URLs by domain and do each group of urls in parallel to go real quick, but to do that I'd probably write it in Node.js instead with phin with async/await.
I was pretty sure that I'm aware of all node http clients but TIL about phin. Thank you! Yeah, there are many ways how to make it faster, I just wanted to unroll it to proper script as a starting point.
Edit: Also maybe a HEAD request would reduce remote load too?
Am I not doing that already? First HEAD and then fallback to GET on receiving 405 (method not allowed).
Yeah, @sbrl it would probably be better to use a standard UA like
Good idea... but that feels a bit dishonest, in declaring that you're someone you're not? It doesn't quite sit right with me.
Added the Bash one-liner to the wiki here: https://github.com/tldr-pages/tldr/wiki/Useful-scripts-and-programs#detect-broken-more-information-links
So closing this issue for now, but we can continue the discussion if needed here.
I was inspired by #5111, and I developed another of my bash one-liners to detect broken more information links in tldr pages. Broken links are defined as having a HTTP status code lower than 200, or greater than or equal to 400. Connection issues (e.g. DNS resolution failures, TLS certificate misconfigurations, etc) are also considered dead links
Here's the one-liner:
cd
to the root of a previously-cloned version of the tldr-pages repo, and then run the above command. Replace/tmp/bad-urls.txt
with the path to the file you'd like to populate with the dead links. Sadly it doesn't tell you which page they are on, but this could be rectified with some refactoring.Requirements:
Note that while
xargs
is used to (indirectly) drive curl here, I haven't added a-P "$(nproc)"
because otherwise you'll end up hitting GitHub's rate limit very quickly.Anyway, here's the current list of dead links I've found with this:
It seems as though 2 of these are in fact not dead according to Firefox, but it would appear that they have HTTPS misconfigurations that caused curl to fail (i.e. they don't send the intermediate certificate along with the actual certificate). Specifically:
...I've emailed the webmaster for mailutils.org about this, but I can't find a suitable contact address for
deluge-torrent.org
.The other URLs are a mix of 404s, no route to host, other more serious HTTPS misconfigurations, and DNS resolution errors.
We wouldn't want to run this as a GitHub action as random build failures would be annoying, but I just thought the exercise was interesting / useful to run from time to time :-)
The final list (should be actually dead urls):
In particular, http://postfix.org/ should be http://www.postfix.org/