Detecting dead "more information" links

sbrl commented 3 years ago

I was inspired by #5111, and I developed another of my bash one-liners to detect broken more information links in tldr pages. Broken links are defined as having a HTTP status code lower than 200, or greater than or equal to 400. Connection issues (e.g. DNS resolution failures, TLS certificate misconfigurations, etc) are also considered dead links

Here's the one-liner:

find . -type f -iname '*.md' -print0 | xargs -0 cat | awk '/> More information/ { match($0, /<(.*)>/, arr); print(arr[1]); }' | sort | uniq | shuf | xargs -n1 -I{} bash -c 'url="{}"; code="$(curl --user-agent "curl; bash; xargs; tldr-pages-bad-url-checker (+https://github.com/tldr-pages/tldr; implemented by @sbrl)" -sSL -o /dev/null -w "%{http_code}" --fail -I "${url}")"; echo "${code} ${url}" >&2; if [[ "${code}" -lt 200 ]] || [[ "${code}" -ge 400 ]]; then echo "${url}"; fi' >/tmp/bad-urls.txt;

cd to the root of a previously-cloned version of the tldr-pages repo, and then run the above command. Replace /tmp/bad-urls.txt with the path to the file you'd like to populate with the dead links. Sadly it doesn't tell you which page they are on, but this could be rectified with some refactoring.

Requirements:

xargs
curl
find
awk
bash

Note that while xargs is used to (indirectly) drive curl here, I haven't added a -P "$(nproc)" because otherwise you'll end up hitting GitHub's rate limit very quickly.

Anyway, here's the current list of dead links I've found with this:

https://harding.motd.ca/autossh
https://nixos.org/releases/nix/latest/manual/#sec-nix-collect-garbage
http://www.mysqlfanboy.com/mytop-3
https://nixos.org/releases/nix/latest/manual#sec-nix-build
https://taskwarrior.org/docs/timewarrior
https://docs.microsoft.com/windows-server/administration/windows-commands/reg-flags
https://clementsam.tech/glab/
https://flameshot.js.org
https://httpie.org
https://deluge-torrent.org
https://deemix.app
https://mailutils.org/manual/html_chapter/Programs.html#frm-and-from
https://docs.blender.org/manual/en/latest/render/workflows/command_line.html
https://www.infradead.org/openconnect/manual.html
https://www.clementine-player.org
https://crates.io/
https://jonls.dk/redshift
http://postfix.org
https://help.bitwarden.com/article/cli/
http://supervisord.org
https://redis.io
https://redis.io/topics/rediscli
https://sqlmap.org

It seems as though 2 of these are in fact not dead according to Firefox, but it would appear that they have HTTPS misconfigurations that caused curl to fail (i.e. they don't send the intermediate certificate along with the actual certificate). Specifically:

https://mailutils.org/manual/html_chapter/Programs.html#frm-and-from
https://deluge-torrent.org/

...I've emailed the webmaster for mailutils.org about this, but I can't find a suitable contact address for deluge-torrent.org.

The other URLs are a mix of 404s, no route to host, other more serious HTTPS misconfigurations, and DNS resolution errors.

We wouldn't want to run this as a GitHub action as random build failures would be annoying, but I just thought the exercise was interesting / useful to run from time to time :-)

The final list (should be actually dead urls):

https://harding.motd.ca/autossh
https://nixos.org/releases/nix/latest/manual/#sec-nix-collect-garbage
http://www.mysqlfanboy.com/mytop-3
https://nixos.org/releases/nix/latest/manual#sec-nix-build
https://taskwarrior.org/docs/timewarrior
https://docs.microsoft.com/windows-server/administration/windows-commands/reg-flags
https://clementsam.tech/glab/
https://flameshot.js.org
https://deluge-torrent.org
https://deemix.app
https://docs.blender.org/manual/en/latest/render/workflows/command_line.html
https://www.infradead.org/openconnect/manual.html
https://jonls.dk/redshift
http://postfix.org
https://sqlmap.org

In particular, http://postfix.org/ should be http://www.postfix.org/

bl-ue commented 3 years ago

Really cool @sbrl. I though of this idea just today!

https://mailutils.org/manual/html_chapter/Programs.html#frm-and-from — @sahilister found this just 27 days ago!

bl-ue commented 3 years ago

A lot of those links are false actually, because you have curl's -I flag, which sets the method to HEAD, and that causes 404s/500s on sites like https://crates.io/ and https://redis.io.

sbrl commented 3 years ago

Oh really? That's very unusual. According to the HTTP spec HEAD should return exactly the same HTTP headers as a GET, just without the body IIRC.

We can remove links that actually work from the list.

sbrl commented 3 years ago

Updated my comment with a revised list

bl-ue commented 3 years ago

I see that. Looks good.

bl-ue commented 3 years ago

@sbrl, the description reads:

Broken links are defined as having a HTTP status code lower than 200, or greater than or equal to 400

But these have 3xx (redirect) status codes:

https://nixos.org/releases/nix/latest/manual/#sec-nix-collect-garbage
https://nixos.org/releases/nix/latest/manual#sec-nix-build
https://docs.microsoft.com/windows-server/administration/windows-commands/reg-flags

schneiderl commented 3 years ago

That is very cool 👍

bl-ue commented 3 years ago

We should add this to tldr-lint 🤔

sbrl commented 3 years ago

But these have 3xx (redirect) status codes:

That's odd. I get an error when I load those links:

Selection_2021-01-16_16:23:31_001_58ccf2c

That is very cool :+1:

Thanks!

We should add this to tldr-lint :thinking:

That'd be cool, but also annoying when tldr-lint randomly fails because of a link that just went dead. It's best I think if its run periodically.

bl-ue commented 3 years ago

That's odd. I get an error when I load those links:

So do I...I don't know what happened! 🤷

bl-ue commented 3 years ago

That'd be cool, but also annoying when tldr-lint randomly fails because of a link that just went dead. It's best I think if its run periodically.

Just for reference (because I just discovered this): https://github.com/tldr-pages/tldr/pull/2684#pullrequestreview-195338362

vladimyr commented 3 years ago

Here's the one-liner:

find . -type f -iname '*.md' -print0 | xargs -0 cat | awk '/> More information/ { match($0, /<(.*)>/, arr); print(arr[1]); }' | sort | uniq | shuf | xargs -n1 -I{} bash -c 'url="{}"; code="$(curl --user-agent "curl; bash; xargs; tldr-pages-bad-url-checker (+https://github.com/tldr-pages/tldr; implemented by @sbrl)" -sSL -o /dev/null -w "%{http_code}" --fail -I "${url}")"; echo "${code} ${url}" >&2; if [[ "${code}" -lt 200 ]] || [[ "${code}" -ge 400 ]]; then echo "${url}"; fi' >/tmp/bad-urls.txt;

@sbrl That's some lovely one-liner :heart: I especially like your user agent :100:

vladimyr commented 3 years ago

@sbrl https://gist.github.com/vladimyr/d7bb314b14a59ca5fbefda3498fac81a :wink:

Sneak peak (didn't have time to run it fully :upside_down_face:):

❯ tail -f bad-links.txt
common/autossh.md https://harding.motd.ca/autossh 0000
common/blender.md https://docs.blender.org/manual/en/latest/render/workflows/command_line.html 404
common/bw.md https://help.bitwarden.com/article/cli/ 500
common/cargo.md https://crates.io/ 404
common/clementine.md https://www.clementine-player.org 405
common/deemix.md https://deemix.app 0000
common/dexdump.md https://manpages.ubuntu.com/manpages/latest/en/man1/dexdump.1.html 404
common/glab.md https://clementsam.tech/glab/ 404
common/mutagen.md https://mutagen.io 405
common/mytop.md http://www.mysqlfanboy.com/mytop-3 0000
common/parquet-tools.md https://github.com/apache/parquet-mr/tree/master/parquet-tools 404
common/redis-cli.md https://redis.io/topics/rediscli 500
common/redis-server.md https://redis.io 500
common/redshift.md https://jonls.dk/redshift 0000

vladimyr commented 3 years ago

A lot of those links are false actually, because you have curl's -I flag, which sets the method to HEAD, and that causes 404s/500s on sites like crates.io and redis.io.

I don't know for Redis but crates.io won't let you scrape their content. You'll get the same response even with regular GET because they maintain UA allowlist on their end.

sbrl commented 3 years ago

@vladimyr Nice one! Don't forget to sleep though in between each call. Many of the links are GitHub, and you'll get rate-limited if you hit GitHub too fast. Alternatively you could group the URLs by domain and do each group of urls in parallel to go real quick, but to do that I'd probably write it in Node.js instead with phin with async/await.

Edit: Also maybe a HEAD request would reduce remote load too?

bl-ue commented 3 years ago

I don't know for Redis but crates.io won't let you scrape their content. You'll get the same response even with regular GET because they maintain UA allowlist on their end.

Yeah, @sbrl it would probably be better to use a standard UA like

Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36

thought yours is nice, it wouldn't work for some sites like @vladimyr mentioned.

vladimyr commented 3 years ago

@vladimyr Nice one! Don't forget to sleep though in between each call. Many of the links are GitHub, and you'll get rate-limited if you hit GitHub too fast.

I like living on the edge 😂

Alternatively you could group the URLs by domain and do each group of urls in parallel to go real quick, but to do that I'd probably write it in Node.js instead with phin with async/await.

I was pretty sure that I'm aware of all node http clients but TIL about phin. Thank you! Yeah, there are many ways how to make it faster, I just wanted to unroll it to proper script as a starting point.

Edit: Also maybe a HEAD request would reduce remote load too?

Am I not doing that already? First HEAD and then fallback to GET on receiving 405 (method not allowed).

sbrl commented 10 months ago

Yeah, @sbrl it would probably be better to use a standard UA like

Good idea... but that feels a bit dishonest, in declaring that you're someone you're not? It doesn't quite sit right with me.

sbrl commented 10 months ago

Added the Bash one-liner to the wiki here: https://github.com/tldr-pages/tldr/wiki/Useful-scripts-and-programs#detect-broken-more-information-links

So closing this issue for now, but we can continue the discussion if needed here.

tldr-pages / tldr

Detecting dead "more information" links #5116