raviqqe / muffet

Fast website link checker in Go
MIT License
2.52k stars 100 forks source link

--follow-sitemap doesn't follow the sitemap #316

Closed zachad closed 1 year ago

zachad commented 1 year ago

Originally added after #32, the --follow-sitemap option should read the sitemap.xml from a site to determine which pages to check.

This doesn't appear to work as expected.

Consider a very simple site with three main pages: /index.html, /page1.html and /secret.html. All three are listed in the /sitemap.xml, and index.html links to page1.html - however none of the pages link directly to secret.html.

When muffet is invoked with --follow-sitemap I expect it to read the sitemap and visit each page that is listed there, including secret.html. What it seems to do instead, is to start with /index.html and scan linked pages recursively only if they're also in the sitemap.

Is this the intended behavior of this option? Is there a way to achieve our use case?

raviqqe commented 1 year ago

I don't remember but I think I, at that time, thought that behavior of adding all pages in the sitemap first is not useful because they are reachable from somewhere else anyway (and in your case, they are not...)

Currently, Muffet treats sitemaps just like robots.txt files. sitemap.xml files are considered to be allowlists of pages that should be checked while robots.txt files are blocklists.

You might be able to use the --single-page option for your use case. But I'm not sure. That also duplicates link checks.

Is the secret.html path like another entry point to your page like the auth page that is available only by specific actions?

nwidger commented 1 year ago

Thanks @raviqqe for the quick response, I work with @zachad so maybe I can add some extra context here.

We have a Hugo-generated support website which is a public website used by our customers and which contains enough support articles that it would be overwhelming to present users with a giant index containing every single page. Because of that, many pages on the site are not linked from anywhere internally. Instead, we pass a sitemap.xml file generated by Hugo (and which contains every page on the site) to search engines such as Google and tell them to use it to index the site. We do something similar with an Algolia-powered search bar available within the support site itself. All pages on the site are free of any authentication/login requirements, therefore all pages are entirely public.

We expect many users to find content on the site either via Google search or using the site's built-in search functionality, therefore even though a page might not be linked internally from within the support site itself it should still (to us anyways) be considered "reachable". Because all pages are reachable by customers via search, we want to ensure they don't contain broken links. Until recently, we had mistakenly believed we were doing that by passing --follow-sitemap to muffet, thinking that this flag caused muffet to add all pages listed in the sitemap.xml file to its list of pages to check. While this is perhaps entirely on us for not having read the documentation close enough, I still think we can't be the only ones in this situation. The very fact that search engines such as Google support using a sitemap.xml to discover more pages to crawl seems to imply that many sites don't internally link to every single page.

Perhaps a new flag such as --include-sitemap-xml which causes muffet to add all pages found within the site's sitemap.xml file to the list of pages to check would be reasonable? Please let me know what you think and if I can provide any more detail that would help you understand our use-case. And thanks for making muffet, we love it!

raviqqe commented 1 year ago

I think the --include-sitemap-xml flag makes sense. The original feature design of --follow-sitemap-xml is rather messed up.

How many pages do you have in those sitemaps in total?

raviqqe commented 1 year ago

Can you test #317 on the main branch?

nwidger commented 1 year ago

@raviqqe Of course! We'll give it a try and report back.

And the sitemap.xml for our support site currently contains roughly 1,500 pages in total.

nwidger commented 1 year ago

@raviqqe The changes you committed to main work great, thank you so much for such a quick fix! Would it be possible to cut a new release of muffet containing these changes?

zachad commented 1 year ago

Thanks so much @raviqqe for building this new capability, and @nwidger for integrating it so quickly!