raviqqe / muffet

Fast website link checker in Go
MIT License
2.46k stars 95 forks source link

links to docs fragments failing after github.com refresh moved content to user scripts #356

Closed qrkourier closed 5 months ago

qrkourier commented 5 months ago

github.com UI now has tabs associated with URL query params like ?tab=readme-ov-file and links like this https://github.com/openziti/edge-api?tab=readme-ov-file#user-content-versioning now fail because github.com moved the content from HTML to a userscript, so muffet can't "see" the target fragment in the HTML and thinks it's broken.

I'll certainly have to stop checking GitHub fragments for now. I'm unsure how to check such links moving forward.

raviqqe commented 5 months ago

Do you think this is duplicate of #144 or #254?

qrkourier commented 5 months ago

Thank you for helping me find those closed issues.

In summary, it's expected that muffet cannot currently parse links that target a destination appearing only in rendered client-side Javascript, only destinations appearing in HTML.

One solution is to tell muffet to stop checking URL "fragments" a.k.a. "anchors", e.g., the #intro part of this URL https://www.example.com/welcome#intro. (reference). Setting muffet --ignore-fragments causes results in checking that the fragment's parent page exists, but not, for example, that the particular heading's "id" property is valid (<a href id="intro">). This applies to all the crawled links.

Another solution is to set an exclude pattern that causes muffet to ignore the entire URL when it belongs to a particular domain, e.g. --exclude='(https?://github\.com/.*#'). With this pattern , fragments are checked for all other sites, and github.com links are not checked at all if they contain the fragment prefix #.