tcort / markdown-link-check

checks all of the hyperlinks in a markdown text to determine if they are alive or dead
ISC License
587 stars 117 forks source link

Links to sites that have DDoS protections are marked as failed #109

Open NicolasMassart opened 4 years ago

NicolasMassart commented 4 years ago

More and more sites have DDoS or anti spam protections like Cloudflare DDoS Attack Protection or Godaddy that makes link checker fail on these links (server returns 502 code for instance if you are not an actual browser)

Only way for the moment to me is to exclude the sites.

But in the future we will have more and more site using these tools and link check will be only able to check internal links. We have to figure out if we have any technical solution to this issue and see how to implement it.

All suggestions are welcome in the comments. Thanks.

NicolasMassart commented 4 years ago

I've been suggested to try using a real browser in these specific failure cases and https://github.com/puppeteer/puppeteer may help. To investigate.

NicolasMassart commented 4 years ago

Cloudflare for instance displays a captcha (click to say you are not a robot) to solve as a security before the page is reached. example

So even with a puppeteer controlled browser it would ne help solving it. I would be too easy ;)

And we can't implement all the security methods of all web providers. Here it's a CloudFlare one but it will evolve and others will create new ways to secure, it's an endless race.

cadeef commented 4 years ago

It might be advantageous to extend MLC enable/disable via annotation to include outcomes that mark as success/fail. I could see an annotation along the lines of markdown-link-check-status: ddos as a means by which to describe an expected behavior; it would permit all other statuses to be checked. This extends to abnormal status codes too-- it would be nice to expect a link to return say a 403 since it isn't authenticated.

NicolasMassart commented 4 years ago

Is expecting a link to return a 403 a way to make the site stronger, not sure. The goal of MLC is to help having non broken links. If we hide the issue behind checking an error code that could be anything else including a real issue, that may not help. it can be a 403 because of ddos protection at first but then really become an access denied one and we won't notice. I wonder if MLC outputting a list of links to manually check at the end would be ok. Like an exclude comment, it would be a manual check comment on a link and it would make listing of links to check possible for MLC but also would document that in the MD code. It's pretty similar to what you suggest with your annotation but will not trigger any real link checking and will never make the tests fail. It's exclude but still keep it in the test list... Not sure what I wrote is very clear though 😄

cadeef commented 4 years ago

I’d say that providing a means to define a “good” link is in the spirit of MLC. There are occasions where it is necessary to redefine good and extending the exclude framework to permit that makes the tool more flexible. I chose 403 specifically, it is a scenario in which the destination exists but MLC is unauthenticated. I actually have a use case for it personally on my family intranet site. I wish to verify that the remote document would succeed if authenticated but not add credentials to the source. There are potentially additional scenarios where defining an alternate status code might be beneficial to the user— HTTP status is a suggestion not the law.

NicolasMassart commented 4 years ago

You totally had my approval at "my family intranet site" 🤯 😉 Extending the exclusion system may be fine but then we have to make sure users understand that this could generate false negatives (instead of currently have false positives). The output in the log have to be clear that the link was not completely verified and that they should check manually to make sure.

cadeef commented 3 years ago

It might be worthwhile investigating the headers that MLC sends with each request. I had a link report "bad" (403) that definitely was not, turns out it was denied by my mod_security setup on a generic rule for not sending an Accept header (log entry sanitized):

[Thu Dec 17 19:48:52.040963 2020] [:error] [pid 11065:tid 139646180726528] [client 127.0.0.1:56384] [client 127.0.0.1] ModSecurity: Access denied with code 403 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf"] [line "47"] [id "960015"] [rev "1"] [msg "Request Missing an Accept Header"] [severity "NOTICE"] [ver "OWASP_CRS/2.2.9"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/MISSING_HEADER_ACCEPT"] [tag "WASCTC/WASC-21"] [tag "OWASP_TOP_10/A7"] [tag "PCI/6.5.10"] [hostname "example.com"] [uri "/do-not-track.txt"] [unique_id "X9wKhFxlGZj7qGQp7uBz5wAAAJU"]
NicolasMassart commented 3 years ago

@cadeef could you post what your modsec would have accepted?

cadeef commented 3 years ago

The issue was the Accept header was missing from the request completely. A standard Accept: */* from MLC would likely do the trick. MDN Header Reference

NicolasMassart commented 3 years ago

Should we modify link-check to add this as a default or only document it here for users to add the header in options if needed? (I tend to prefer adding the default in link-check)

NicolasMassart commented 3 years ago

But anyway it doesn't seem to be enough (would be too easy) for Cloudflare DDoS protection se for instance on this page https://metamask.zendesk.com/hc/en-us/articles/360015488991-Sending-Ether-New-UI- There's no easy way obviously...

We need to keep improving Web fabric as broken links are bad for the Web, it makes its original purpose fail. Having a free interconnected network where everyone is equally able to access data... it's less and less true every day, but I keep thinking we have to help finding ways to prevent total failure to happen.

Here are some thoughts about things we might do here (or in Link-check):

cadeef commented 3 years ago

Should have been more clear in my initial comment, I was doubtful the change would make a difference with Cloudflare. Projects of MLC's nature will continually be on the losing end of the request inspection cat and mouse.

Shoring up the request profile for MLC would likely eliminate some existing edge cases (like mine with standard mod_sec rules), but it will always be a chase. WAFs evolve constantly by design.

I'd argue it's worthwhile, at least in the short term, to patch low hanging fruit that maintains the standard user experience without additional annotation.

An exception framework, with toggleable handling, would be welcome as I've eluded before, but it creates additional complexity. Care in implementation would be necessary to ensure that simplicity of the primary cause (dead simple, bolt-on markdown link checking) is maintained.

I like the idea of optionally punting to a headless third-party service in the event of failure, but it opens up a can of worms (availability, privacy, support) that would need to be carefully navigated.

NicolasMassart commented 3 years ago

I agree with your comments. DDoS protection on websites is something I have more and more issues with though... Maybe it's specific to the domain I work in (blockchain) but I think it's, like spam, something we will have to live with... and it will break the basic fabric of the Web. Is it the role of MLC to fix this? Absolutely not, of course, but can we provide tools to help users deal with that? I think so. What tools? I'm not sure. Third party api use could for instance be optional and disabled by default. It may be an addon or something that require to accept the privacy impact and provide an api key anyway (so accepting the API terms of use).

I think MLC has to evolve anyway. If we want it to remain a small script, we will end with something that will be useless at some point. If we build a proper software, with extension capabilities by design, we will be able to improve and follow users needs. I created a discussion about roadmap https://github.com/tcort/markdown-link-check/discussions/147 I can't force anyone to participate, but my opinion is that if we don't discuss the goals for MLC, we will only be able to deal with bug issues and some very use case specific improvements. We need to lead this as a real product and don't be afraid to refactor. We can try to keep backward compatibility, but that's not a requirement for me if we provide at least the same features and even better ones. If the config file format has to change for that, I think it's not a blocker for many people.

sanmai-NL commented 2 years ago

Resolving #111 would help.

Kurt-von-Laven commented 2 years ago

As an example of what you all long ago predicted, all links to GitHub Docs (https://docs.github.com/*) started failing for me within the past 13 hours with 403s. I am running markdown-link-check 3.10.0, and the issue is not related to a change in the version of markdown-link-check, so I assume GitHub enhanced their DDoS protections. I am a fan of the proposal to support expecting a 403 but an even bigger fan of the idea of escalation to a headless browser (and, if warranted, additionally escalating from a headless browser to a full browser). I don't think anyone would argue that Puppeteer is a panacea to the inherent arms race markdown-link-check finds itself faced with, but it has far more resources to keep up with changing expectations. I may have missed some pertinent issues, but these were the only two hits for DDoS in their issue tracker:

As suggested in the latter, puppeteer-extra-plugin-stealth can be used to dodge most DDoS protections. As an overall strategy, consolidating around a single (or small number of) large project(s) seems a solid one for the open-source community since we are much stronger together than separate.

Kurt-von-Laven commented 2 years ago

I was wrong. The issue with links to GitHub Docs had nothing to do with DDoS, but rather GitHub evidently started requiring that clients accept compression. This .markdown-link-check.json resolved the issue for us.

bobrik commented 1 month ago

I hit this with an origin backed by Netlify. They return 403 for User-agent: link-check/5.3.0 header:

$ curl -s -H "user-agent: link-check/5.3.0" -I https://ebpf.io/what-is-ebpf/ | head -n1
HTTP/2 403

The following in the config fixed it:

    "httpHeaders": [
        {
            "urls": [
                "https://"
            ],
            "headers": {
                "user-agent": "pls let me in"
            }
        }
    ],