raviqqe / muffet

Fast website link checker in Go
MIT License
2.49k stars 97 forks source link

`--exclude` patterns are not matched against full URL #211

Closed cipriancraciun closed 2 years ago

cipriancraciun commented 2 years ago

I'm trying to validate a site served via CloudFlare. CloudFlare inserts URLs of the form https://example.com/cdn-cgi/l/email-protection#5c333a3... (where example.com is my domain).

However if I try to --exclude '^https://example\.com/cdn-cgi/.*$' it seems not to work, as muffet still complains about missing URLs, although those should match the given regex.

This seems to be because CloudFlare actually inserts just /cdn-cgi/... in the HTML, and muffet matches --exclude on the "raw" URL, not the "resolved" URL (that should include the document base).

It would be nice if --exclude also tried to match the regex against the "resolved" URL, in addition to the "raw" one.


On the same topic, does muffet execute some sort of normalization against the URL before matching? (For example at least percent encoding/decoding sanitization.)

lpar commented 2 years ago

Include patterns aren't matched against the full URL either. This is problematic, because a common use case is to validate all the links within a web site, and not care too much if other web sites have broken your links to them.

e.g. muffet -i www.example.com https://www.example.com

I think this a bug, because the help says:

  -e, --exclude=<pattern>...                Exclude URLs matched with given regular expressions
  -i, --include=<pattern>...                Include URLs matched with given regular expressions

With the current behavior, URLs which would match the include pattern get excluded because their path doesn't match the include pattern.

About to open a PR which I think fixes it.