oxsecurity / megalinter

🦙 MegaLinter analyzes 50 languages, 22 formats, 21 tooling formats, excessive copy-pastes, spelling mistakes and security issues in your repository sources with a GitHub Action, other CI tools or locally.
https://megalinter.io
GNU Affero General Public License v3.0
1.89k stars 224 forks source link

Add `User-Agent` in link checking #3304

Closed andrewvaughan closed 6 months ago

andrewvaughan commented 7 months ago

Is your feature request related to a problem? Please describe. Many websites block (403 response) requests without a User-Agent HTTP request header set. This causes link checkers to automatically fail.

For https://github.com/tcort/markdown-link-check/ a proper issue has already been raised (https://github.com/tcort/markdown-link-check/issues/172); however, given that, for one, this issue is now almost 3-years old without a response, and, for two, it's better for each individual client to provide their unique User-Agent to be a good netizen, I recommend having MegaLinter provide a versioned User-Agent in their default configurations.

Describe the solution you'd like Add the following to the default https://github.com/oxsecurity/megalinter/blob/main/TEMPLATES/.markdown-link-check.json configuration, but also to any other link-checking linters that may exist:

{
  // ... existing config ...

  "httpHeaders": [
    {
      "urls": ["http", ".", "/"],
      "headers": {
        "User-Agent": "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.8.0; +https://megalinter.io)"
      }
    }
  ]

For more information on User-Agent header best practices and why I recommend the above: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Describe alternatives you've considered

Additional context

My MegaLinter right now:

Please?

andrewvaughan commented 7 months ago

So I still recommend this - because I was able to prove this was a blocking case using curl on my machine with and without user-agent - and confirmed that this resolved my issues if I ran markdown-link-checker directly on my machine.

But, for some reason, running this with the given configuration still fails in a megalinter run still fails, almost as if the config isn't being applied:

❌ Linted [MARKDOWN] files with [markdown-link-check]: Found 3 error(s) - (15.39s)
- Using [markdown-link-check v3.11.2] https://megalinter.io/7.7.0/descriptors/markdown_markdown_link_check
- MegaLinter key: [MARKDOWN_MARKDOWN_LINK_CHECK]
- Rules config: [/.config/linters/.markdown-link-check.json]
- Number of files analyzed: [11]
--Error detail:

  ERROR: 1 dead links found in .github/CONTRIBUTING.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in .github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in _TEMPLATE_CHECKLIST.md !
  [✖] https://stackoverflow.com/questions/32964920/should-i-commit-the-vscode-folder-to-source-control → Status: 403

I took megalinter out of the equation as much as possible and ran the markdown-link-check tool directly on the container to see if anything was different, and it still failed:

$ docker exec -it megalinter markdown-link-check -q -v -c /tmp/lint/.config/linters/.markdown-link-check.json /tmp/lint/.github/SUPPORT.md
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in /tmp/lint/.github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

But running the exact same command in the project outside of the container is successful:

$ npx -y markdown-link-check -q -v -c .config/linters/.markdown-link-check.json .github/SUPPORT.md
(node:45774) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)

# (No errors)

Although I still believe this belongs in megalinter, because I cannot replicate the issue in any other environment except for the megalinter Docker container.

And just for proactivity sake:

$ docker exec -it megalinter markdown-link-check --version                                         
3.11.2

$ npx markdown-link-check --version
(node:43301) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
3.11.2

Tagging @tcort if they have any thoughts.

🤔

andrewvaughan commented 7 months ago

More progress - and I think this actually may need to be either moved or duplicated to https://github.com/tcort/markdown-link-check now, because of what I found.

I wanted to completely remove the idea that running in a container itself was the issue, so I followed the https://github.com/tcort/markdown-link-check directions on running markdown-link-check within Docker instead of via npx. I'll be honest, I fully expected this to work fine...

$ docker run -v ${PWD}:/tmp:ro --rm -i ghcr.io/tcort/markdown-link-check:stable -q -v -c /tmp/.config/linters/.markdown-link-check.json /tmp/.github/SUPPORT.md
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in /tmp/.github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

...but it didn't.

It also fails in a much more simple environment:

$ docker run -it -v ${PWD}:/tmp:ro --rm node npx -y markdown-link-check -q -v -c /tmp/.config/linters/.markdown-link-check.json /tmp/.github/SUPPORT.md
(node:19) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

  ERROR: 1 dead links found in /tmp/.github/SUPPORT.md !
  [✖] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 → Status: 403

It seems that the issue is the user-agent (or something) is not working, but only while run within a Docker container (or something specific about how either both the Megalinter and this Docker container are built/configured).

Note - this problem happens whether I call it like I did above, or bash into the section and run markdown-link-checker from the command line - I presented it here in single-line for easy of reproduction.

Interestingly, curl also fails in a similar manner, making me think this might be an underlying Docker configuration or utility issue:

# Works fine locally...
$ curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 200 
date: Sun, 21 Jan 2024 21:44:16 GMT
# etc...

# Fails on the basic `node` image....
$ docker run -it --rm node curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 403 
date: Sun, 21 Jan 2024 21:44:50 GMT
# etc...

# And even fails on the base `alpine` image...
docker run -it --rm alpine sh -c 'apk update -q; apk add -q curl; curl -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" -I "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"'

HTTP/2 403 
date: Sun, 21 Jan 2024 21:54:03 GMT
# etc....

I'm going to do some more digging to see if it's an issue with a commonality, Docker, or otherwise, but this is a deeper issue than I expected. Unfortunately, markdown-link-check has practically no debugging information outputted when configured, so getting more information on how the request was formatted (which I need to debug this) is going to be a challenge.

To be clear - don't close this issue. Adding the user-agent above is still a very, very good idea. This is indicative of a secondary problem.

andrewvaughan commented 7 months ago

More updates... it works fine with wget... well, at least on the base images. Just not these configurations. I think I may have pintpointed the source of the error (at least in the shell) to BusyBox.

# Local works fine...
$ wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378
--2024-01-21 16:55:55--  https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378
Resolving meta.stackexchange.com (meta.stackexchange.com)... 172.64.144.30, 104.18.43.226
Connecting to meta.stackexchange.com (meta.stackexchange.com)|172.64.144.30|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK

# Alpine fails out of the box
$ docker run -it --rm alpine sh -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (172.64.144.30:443)
  HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden

# But Alpine works fine if we reinstall wget...
$ docker run -it --rm alpine sh -c 'apk update -q; apk add -q wget; wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
--2024-01-21 22:03:56--  https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378
Resolving meta.stackexchange.com (meta.stackexchange.com)... 172.64.144.30, 104.18.43.226
Connecting to meta.stackexchange.com (meta.stackexchange.com)|172.64.144.30|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK

# Ubuntu works fine...
$ docker run -it --rm --entrypoint /bin/sh ubuntu -c 'apt update -qq; apt install -y -qq wget; wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'

# ...

Connecting to meta.stackexchange.com (meta.stackexchange.com)|104.18.43.226|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK

# Not Megalinter (which is alpine-based)
$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0 -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (104.18.43.226:443)
  HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden

# Nor markdown-link-check (which is also alpine-based)
$ docker run -it --rm --entrypoint /bin/sh ghcr.io/tcort/markdown-link-check:stable -c 'wget --server-response -U "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378'
Connecting to meta.stackexchange.com (172.64.144.30:443)
  HTTP/1.1 403 Forbidden
wget: server returned error: HTTP/1.1 403 Forbidden

So what about versions?

# Works fine
$ wget
GNU Wget 1.21.4 built on darwin22.4.0.

# Fails
$ docker run -it --rm alpine sh -c 'which wget; wget'
/usr/bin/wget
BusyBox v1.36.1 (2023-11-07 18:53:09 UTC) multi-call binary.

# Works fine
$ docker run -it --rm alpine sh -c 'apk update -q; apk add -q wget; which wget; wget --version'
/usr/bin/wget
GNU Wget 1.21.4 built on linux-musl.

# Works fine
$ docker run -it --rm --entrypoint /bin/sh ubuntu -c 'apt update -qq; apt install -y -qq wget; which wget; wget --version'
/usr/bin/wget
GNU Wget 1.21.2 built on linux-gnu.

# Fails
$ docker run -it --rm alpine sh -c 'which wget; wget'
/usr/bin/wget
BusyBox v1.36.1 (2023-11-07 18:53:09 UTC) multi-call binary.

# Fails
$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0 -c 'which wget; wget'  
/usr/bin/wget
BusyBox v1.36.1 (2023-11-06 11:32:24 UTC) multi-call binary.

So that's interesting... it seems that the default https://github.com/mirror/busybox bundle is the common point of failure on these devices.

I wonder if this could be solved simply by adding a proper apk add wget to each of your Dockerfiles _(of course, presuming this is what markdown-link-check is using in the background... my next deep dive down this rabbit hole).

nvuillam commented 7 months ago

What an investigation @andrewvaughan :D

it seems markdown-links-check

Maybe needle has different behaviors depending of environment variables ?

Something to check that could be to expose a mock service, and log the calls within docker and out of docker to see the differences :)

andrewvaughan commented 7 months ago

What an investigation @andrewvaughan :D

it seems markdown-links-check

Maybe needle has different behaviors depending of environment variables ?

Something to check that could be to expose a mock service, and log the calls within docker and out of docker to see the differences :)

Lol you should see the comment I was half-way through writing...

I have gone the depths of the dependency stack. I am weary and tired, but I bear the fruits of my labor:

Bear with me friends, because this is where my soul started tearing apart. The code was a jungle.

Which brought me to the Node.js core source-code with even LESS documentation...

Ergo, visa-vis

...that's about as far as I got

andrewvaughan commented 7 months ago

Narrowed it down:

$ docker run --entrypoint /bin/bash -it --rm oxsecurity/megalinter-python:v7.7.0

# curl --no-alpn -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/1.1 200 OK

# curl -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378"
HTTP/2 403

There must be something either about the HTTP/2 protocol or ALPN that is flagging the linter as an unwanted bot.

Although interesting, forcing HTTP/1.1 alone does not suffice:

# curl --http1.1 -I -A "Mozilla/5.0 (compatible; markdown-link-check/3.11.2; MegaLinter/7.7.0; +https://megalinter.io)" "http
HTTP/1.1 403 Forbidden

Now why these remote hosts are allowing non-ALPN traffic through but blocking ALPN traffic is an interesting question.

Edit: I've answered this - every response came back with a cf-mitigated: response header. This means a Cloudflare WAF is in place and is putting that "checking to see if you're human" page in place. See https://github.com/tcort/link-check/issues/72 for more details.

I will move the remainder of this discovery over to an issue on the markdown-link-checker, but I would definitely consider adding a unique UserAgent to Megalinter with this Issue - it will help prevent the default Needle/X.X UserAgent that gets applied from getting over-blocked.

echoix commented 7 months ago

Wow, what a pleasure to read at the whole thinking process.

Near the beginning of the thread, I was thinking on trying a Debian/debian slim/ubuntu container too. Sometimes, to make sure that I'm not hitting some particular differences of musl-based packages, it's always good to check if it should work without it. In the last couple of years of still being subscribed to notifications on the node-red docker repo, you'd be surprised about the frequency of weird behaviours that doesn't happen with a Debian based base image (as they have both).

I'd never thought of going as deep as you did, you even learned me a new word, ALPN! Your links to the source code in node.js, in a well written debugging summary like this would be good candidates to being permalinks to be able to reread the good thing in a months time (the issues that you made that would be referencing this might take a while).

As for the user agent, I have three contradicting opinions. On one side, it is reasonable and your explanations justify correctly the need to have a user agent. On another side, shouldn't it be a user agent for the linter rather than Megalinter? While you are talking specifically talking about markdown-link-checker, the linter I struggle a bit more with lychee. That brings to the third competing opinion: some sites answer completely differently by user-agent. Wink Wink SourceForge. Even though I already found it out on myself before, it was apparent when working with a winget definition for a new software version, where the download URLs work only in specific cases. (Luckily they have an arrangement so their CI works better than locally). But these differences came back at the beginning of the introduction of lychee linter, before getting stuff smoothed out. So here, sometimes having the generic most common user agent is the only way to have a (badly) configured website to work at all.

So I can't decide yet what will weight more in the balance.

andrewvaughan commented 7 months ago

Thanks for the kind words!

Per your concerns on the UA - you're 100% on point. That's why I particularly recommended the pattern of UA that I did. There's a link I put above with best-practices on generating UAs.

Most "crawlers" literally put "crawler/2.2.2" which can be problematic, if only because some lazy admins block "everything not standard," which was never intended for UAs. That's where marking a compatible comes in.

The UA format is <name>/<version> <comment> - very, very simple. Filtering is only supposed to happen on the first two, but platform standard needs did add some filtering in the comment area informally.

As such, nearly all UAs for browsers are:

BrowserName/1.1 (System Information)

With some level of standardization in what the (System Information) entails.

However! That comment can technically be anything - and there actually is a better pattern for applications that meet the "requirements" of a browser standard but make use of it in a different way; for example:

Mozilla/5.0 (compatible; technology/x.x.x; technology/x.x.x; +https://reference)

This is a great pattern, because it both informs the server as to what standard can be managed and allows for fine-tune bot management by administrators. Maybe someone wants to block all of MegaLinter - maybe just link checkers. Maybe just particular, problematic versions. It's their choice in this format with some simple string-matching.

So you end up with something like the recommendation above, or, for something more simple, the following:

Mozilla/5.0 (compatible; MegaLinter/7.8.0; +https://megalinter.io)

The reference URL at the end is also super helpful - coming from an admin, if I were to start seeing this new UA appear out of everywhere, my first reaction would be to block it. A responsible admin, however, will check the reference to see what its purpose is and determine as to whether it is nefarious or not for the purposes of the applications. Systems like Web Application Firewalls learn from this, and you might even start to see MegaLinter UAs have fewer and fewer issues with systems like Cloudflare WAF (which ended up being the primary cause of the problem, here - it turned out to not be the UA, at all... at least, not on its own).

Unfortunately, without any specification, you end up with the default for whatever the linters are - or sometimes no UA at all. For markdown-link-check, I believe it's using the default from the needle library it's using, which is needle/x.x.x.

Now, imagine how many people have probably used the needle dependency to make more ...problematic... bots for servers. How could they tell MegaLinter traffic from those bad bots? There's really no way. Easier to just block the entire batch, and you'll be much more likely to get picked up in a WAF as a nefarious bot.

So, for me - I think the question is whether the responsibility of setting an appropriate UA is for the tool or the tool container. I lean toward the argument that the UA should always represent the technology closest to the end user (in this case, MegaLinter, being the utility I chose to incorporate into my project, not necessarily the specific linter), so I would prefer my UA to represent Mozilla/5.0 (compatible; MegaLinter/7.8.0; +https://megalinter.io) - but this is a personal opinion. There's a strong argument to include the individual linter details in each UA, as well.

This is just me thinking out loud, but that's my $0.00002 on the issue!

Edit: I realized I didn't touch on a concern - there's always the default argument to just "copy/paste" a "known working" UserAgent to mimic a browser entirely... but WAFs caught on to that decades ago, and it's barely worthwhile these days. It has to do with usage patterns - raises AI eyebrows when "iOS Safari" only makes HEAD requests to human-readable endpoints and does about 30 distinct ones within 2 seconds.

That said... you can always offer a configurable override to end-users!

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.