raviqqe / muffet

Fast website link checker in Go
MIT License
2.52k stars 100 forks source link

429 errors when using `--max-connections` #172

Closed PatrickHeneise closed 2 years ago

PatrickHeneise commented 3 years ago

We have a bunch of GitHub issue links on a site, and even with --max-connections=10 --buffer-size=8192 --color=always --rate-limit=2 we're running in a lot of 429 errors. Any suggestion on how to avoid this?

raviqqe commented 3 years ago

It might be related to redirections where we currently don't apply rate limits. Can you send an example URL of the GitHub issue links?

PatrickHeneise commented 3 years ago

Thanks for having a look at this. They're mostly direct links to GitHub issues:

    429 https://github.com/cortexproject/cortex/pull/3897
    429 https://github.com/cortexproject/cortex/pull/3904
    429 https://github.com/cortexproject/cortex/pull/3905
    429 https://github.com/cortexproject/cortex/releases/tag/v1.8.0-rc.1
    429 https://github.com/grafana/loki/releases/tag/v2.2.0
    429 https://github.com/opstrace/opstrace/issues/412
    429 https://github.com/opstrace/opstrace/issues/442
    429 https://github.com/opstrace/opstrace/issues/465
    429 https://github.com/opstrace/opstrace/issues/483
    429 https://github.com/opstrace/opstrace/pull/397
    429 https://github.com/opstrace/opstrace/pull/413
    429 https://github.com/opstrace/opstrace/pull/441
    429 https://github.com/opstrace/opstrace/pull/453
    429 https://github.com/opstrace/opstrace/pull/472
    429 https://github.com/opstrace/opstrace/pull/482
    429 https://github.com/opstrace/opstrace/pull/487
PatrickHeneise commented 3 years ago

I'm trying -max-connections-per-host=2 but then I'm getting a lot of no free connections available to host errors. The most reliable config I got so far is:

muffet https://... --rate-limit=5 --skip-tls-verification --buffer-size=8192 --exclude="gstatic.com|linkedin.com...

but even then I get 429s from all the GitHub links we have in our docs and articles.

Another option would be to ignore http statuses, so instead of throwing an error on 429, maybe just a warning and let it pass?

PatrickHeneise commented 3 years ago

@raviqqe did you get a chance to look into this?

raviqqe commented 3 years ago

The --max-connections-per-host option should work better on the main branch (not the latest release of v2.4.5.) Can you test it?

PatrickHeneise commented 3 years ago

I'm not too familiar with Go, how do I install from the main branch?

raviqqe commented 3 years ago

Like this:

git clone https://github.com/raviqqe/muffet
GO111MODULE=on go build
./muffet https://foo.com
PatrickHeneise commented 3 years ago

Same, either getting 429 or no free connections available to host (tried with --max-connections-per-host 2 and 5)

raviqqe commented 2 years ago

Can you give me an example URL to test? I can't reproduce the problem with my websites with GitHub URLs.

PatrickHeneise commented 2 years ago

muffet https://opstrace.com --max-connections-per-host=10 --rate-limit=5 --exclude="gstatic.com|linkedin.com|googletagmanager.com"

We're using the GitHub Action to check for broken links on a regular basis. There are some pages with lots of GitHub links, that's when GitHub starts to throw 429s.

raviqqe commented 2 years ago

I don't see any 429 errors from GitHub anymore with --rate-limit 1. Can you try it in your environment? --rate-limit 5 seems to be still too high for links at github.com.

> go run . --ignore-fragments --max-connections-per-host 10 --rate-limit 1 --exclude "gstatic.com|linkedin.com|googletagmanager.com" https://opstrace.com 
https://opstrace.com/cdn-cgi/l/email-protection
        403     https://support.cloudflare.com/hc/en-us/articles/200170016-What-is-Email-Address-Obfuscation-
        403     https://www.cloudflare.com/sign-up?utm_source=email_protection
https://opstrace.com/terms-of-service
        404     https://go.opstrace.com/dpa
        429     https://stripe.com/legal
https://opstrace.com/blog/collecting-metrics-from-cockroachdb
        dial tcp4 127.0.0.1:8080: connect: connection refused   http://localhost:8080/#/metrics/overview/cluster
https://opstrace.com/docs/quickstart
        lookup $opstrace_name.opstrace.io: no such host https://$OPSTRACE_NAME.opstrace.io/login
        lookup prod.$opstrace_name.opstrace.io: no such host    https://prod.$OPSTRACE_NAME.opstrace.io/grafana/explore?orgId=1&left=%5B%22now-30m%22,%22now%22,%22metrics%22,%7B%7D%5D
        lookup staging.$opstrace_name.opstrace.io: no such host https://staging.$OPSTRACE_NAME.opstrace.io/grafana/explore?orgId=1&left=%5B%22now-30m%22,%22now%22,%22metrics%22,%7B%7D%5D
https://opstrace.com/blog/introducing-the-open-source-distribution
        404     https://kinvolk.io/flatcar-container-linux
https://opstrace.com/docs/guides/contributor/writing-docs
        400     https://www.grammarly.com
https://opstrace.com/media
        404     https://dok.community/dokc-day-schedule/
https://opstrace.com/privacy-gdpr-supplement
        404     https://go.opstrace.com/dpa
https://opstrace.com/blog/nextjs-on-cloudflare
        403     https://developers.cloudflare.com/images/resizing-with-workers
        403     https://developers.cloudflare.com/pages/platform/github-integration
        403     https://developers.cloudflare.com/pages/platform/known-issues
        429     https://stripe.com/
        503     https://blog.cloudflare.com/cloudflare-pages-ga/
https://opstrace.com/docs/guides/user/configuring-alerts
        404     https://opstrace.com/docs/guides/user/#configure-a-contact-point
        404     https://opstrace.com/docs/guides/user/#configure-a-notification-policy
        404     https://opstrace.com/docs/guides/user/#configure-an-alerting-rule
        404     https://opstrace.com/docs/guides/user/#using-the-http-api-to-configure-alerts
https://opstrace.com/blog/week-12-update
        404     https://github.com/opstrace/opstrace/tree/main/test/test-remote/containers/looker
exit status 1
raviqqe commented 2 years ago

Ah actually, I found a bug where muffet doesn't handle cross origin redirects properly. I'm gonna fix that and come back.

raviqqe commented 2 years ago

This should be fixed on the main branch now. Let me know if you still see similar errors.

Note that you might still rarely see no free connections available to host errors because Muffet doesn't have full control over those connections. If you run into the errors often, please open another issue. At least, we should be able to find some workaround.

After the fix, I can consistently run Muffet with the following options at https://opstrace.com. The User-Agent header seems to be required for stripe.com pages which returns 429 errors when it's not set.


> go run . --ignore-fragments --max-connections-per-host 10 --rate-limit 1 --exclude "gstatic.com|linkedin.com|googletagmanager.com" --header 'User-Agent: muffet' --buffer-size 10000 https://opstrace.com 
https://opstrace.com/docs/quickstart
        lookup $opstrace_name.opstrace.io: no such host https://$OPSTRACE_NAME.opstrace.io/login
        lookup prod.$opstrace_name.opstrace.io: no such host    https://prod.$OPSTRACE_NAME.opstrace.io/grafana/explore?orgId=1&left=%5B%22now-30m%22,%22now%22,%22metrics%22,%7B%7D%5D
        lookup staging.$opstrace_name.opstrace.io: no such host https://staging.$OPSTRACE_NAME.opstrace.io/grafana/explore?orgId=1&left=%5B%22now-30m%22,%22now%22,%22metrics%22,%7B%7D%5D
https://opstrace.com/docs/references/configuration
        lookup aws.amazon.com: no such host     https://aws.amazon.com/blogs/containers/introducing-the-new-amazon-eks-console
https://opstrace.com/blog/collecting-metrics-from-cockroachdb
        dial tcp4 127.0.0.1:8080: connect: connection refused   http://localhost:8080/#/metrics/overview/cluster
https://opstrace.com/blog/introducing-the-open-source-distribution
        404     https://kinvolk.io/flatcar-container-linux
https://opstrace.com/media
        404     https://dok.community/dokc-day-schedule/
https://opstrace.com/terms-of-service
        404 (following redirect https://opstrace.com/Opstrace%20-%20Data%20Processing%20Addendum%203.30.2021.pdf)       https://go.opstrace.com/dpa
https://opstrace.com/blog/nextjs-on-cloudflare
        503     https://developers.cloudflare.com/images/resizing-with-workers
        503     https://developers.cloudflare.com/pages/platform/github-integration
        503     https://developers.cloudflare.com/pages/platform/known-issues
https://opstrace.com/cdn-cgi/l/email-protection
        403     https://support.cloudflare.com/hc/en-us/articles/200170016-What-is-Email-Address-Obfuscation-
        403     https://support.cloudflare.com/hc/en-us/categories/200275218-Getting-Started
        403 (following redirect https://dash.cloudflare.com/sign-up?utm_source=email_protection)        https://www.cloudflare.com/sign-up?utm_source=email_protection
https://opstrace.com/docs/guides/user/configuring-alerts
        404     https://opstrace.com/docs/guides/user/#configure-a-contact-point
        404     https://opstrace.com/docs/guides/user/#configure-a-notification-policy
        404     https://opstrace.com/docs/guides/user/#configure-an-alerting-rule
        404     https://opstrace.com/docs/guides/user/#using-the-http-api-to-configure-alerts
https://opstrace.com/privacy-gdpr-supplement
        404 (following redirect https://opstrace.com/Opstrace%20-%20Data%20Processing%20Addendum%203.30.2021.pdf)       https://go.opstrace.com/dpa
https://opstrace.com/blog/week-12-update
        404     https://github.com/opstrace/opstrace/tree/main/test/test-remote/containers/looker
exit status 1
PatrickHeneise commented 2 years ago

Awesome, thanks a lot!