raviqqe / muffet

Fast website link checker in Go
MIT License
2.49k stars 96 forks source link

Add support for warn versus error or ignore #292

Open spkane opened 1 year ago

spkane commented 1 year ago

It would be nice to categorize HTTP error codes, and URL patterns that should be reported, but not trigger an error.

That way you can report on some less critical errors, that you still might want to fix, or at least be aware of.

At the moment, because things can only be ignored (or cause a failure), you may be forced to ignore a pattern, which will also make you blind to any actual failures that crop up later with that URL, etc.

spkane commented 1 year ago

Ideally, this would build on the feature in https://github.com/raviqqe/muffet/issues/291, but it could be done initially with just the pattern arguments.

spkane commented 1 year ago

To build on this a bit it would be really flexible, if I could set this globally or per pattern so that I could do something for www.unix.com which would allow me to ignore, or only warn on the 403 it always responds with, but still error on a 404, for that particular site. LinkedIn reports 999 on public profiles, for whatever reason, so that is another useful example.

--exclude {name=www.unix.com/man-page/linux, ignore=403, warn=308} --exclude {name=linkedin.com/in/, ignore=999 }

*** INFO: [2023-03-09 17:48:22] Start checking: "https://example.com"
https://example.com/journal/unix-programming/
    403 (following redirect https://www.unix.com/man-page/linux/5/init/)    http://www.unix.com/man-page/linux/5/init/
*** ERROR: [2023-03-09 17:48:47] Something went wrong - see the errors above...
raviqqe commented 1 year ago

What kinds of status codes do you want to mark as warnings? For example, is reducing 308 for SEO?

spkane commented 1 year ago

As an example, one might want to know about a redirect so that it can eventually be fixed, without it actually throwing an error and therefore breaking a deployment of a website change.

raviqqe commented 1 year ago

What is the size of your website? For example, how many pages and links does it have roughly?

spkane commented 1 year ago

It is not huge, but we do have a lot of long technical blog articles, that tend to link out to other sites, whose links and general behavior are more likely to change or become invalid over time.

spkane commented 1 year ago

I could see value in being able to pass this information in via a config file when there are a lot of rules, in addition to simply supplying a few options on the command line when the rules are very simple.

Sieboldianus commented 1 year ago

I want to bump this issue/idea. I have a Hugo site with about 4500 links that I check via Gitlab CI. Basically everytime I add a new blog post the CI tests break and I need to update my exclude-list. Currently, the script looks like below with the ... meaning many more --exclude lines.

#!/bin/bash

LOCAL_HOST="http://localhost:1313/links/"
MAX_WAIT_TIME=60 # 30 sec
OPTIONS="--exclude 'reddit.com' \
         --exclude 'anaconda.org' \
         --exclude 'arxiv.org' \
         --exclude 'docker.com' \
         --exclude 'stackoverflow.com' \
         --exclude 'linuxize.com' \
         --exclude 'cyberciti.biz' \
         --exclude 'gitlab.yourgitlab.com' \
         --exclude 'openai.com' \
         --exclude '^*.webm$' \
         ...
         --ignore-fragments \
         --max-response-body-size 100000000 \
         --junit > rspec.xml"

for i in $(seq 0 ${MAX_WAIT_TIME}); do # 5 min
    sleep 0.5
    IS_SERVER_RUNNING=$(curl -LI ${LOCAL_HOST} -o /dev/null -w '%{http_code}' -s)
    if [[ "${IS_SERVER_RUNNING}" == "200" ]]; then
        eval muffet "${OPTIONS}" ${LOCAL_HOST} && exit 0 || exit 1
    fi
done

echo "error: time out $((${MAX_WAIT_TIME}/2)) sec" && exit 1