Closed zachad closed 1 year ago
I don't remember but I think I, at that time, thought that behavior of adding all pages in the sitemap first is not useful because they are reachable from somewhere else anyway (and in your case, they are not...)
Currently, Muffet treats sitemaps just like robots.txt
files. sitemap.xml
files are considered to be allowlists of pages that should be checked while robots.txt
files are blocklists.
You might be able to use the --single-page
option for your use case. But I'm not sure. That also duplicates link checks.
Is the secret.html
path like another entry point to your page like the auth page that is available only by specific actions?
Thanks @raviqqe for the quick response, I work with @zachad so maybe I can add some extra context here.
We have a Hugo-generated support website which is a public website used by our customers and which contains enough support articles that it would be overwhelming to present users with a giant index containing every single page. Because of that, many pages on the site are not linked from anywhere internally. Instead, we pass a sitemap.xml
file generated by Hugo (and which contains every page on the site) to search engines such as Google and tell them to use it to index the site. We do something similar with an Algolia-powered search bar available within the support site itself. All pages on the site are free of any authentication/login requirements, therefore all pages are entirely public.
We expect many users to find content on the site either via Google search or using the site's built-in search functionality, therefore even though a page might not be linked internally from within the support site itself it should still (to us anyways) be considered "reachable". Because all pages are reachable by customers via search, we want to ensure they don't contain broken links. Until recently, we had mistakenly believed we were doing that by passing --follow-sitemap
to muffet
, thinking that this flag caused muffet
to add all pages listed in the sitemap.xml
file to its list of pages to check. While this is perhaps entirely on us for not having read the documentation close enough, I still think we can't be the only ones in this situation. The very fact that search engines such as Google support using a sitemap.xml
to discover more pages to crawl seems to imply that many sites don't internally link to every single page.
Perhaps a new flag such as --include-sitemap-xml
which causes muffet
to add all pages found within the site's sitemap.xml
file to the list of pages to check would be reasonable? Please let me know what you think and if I can provide any more detail that would help you understand our use-case. And thanks for making muffet, we love it!
I think the --include-sitemap-xml
flag makes sense. The original feature design of --follow-sitemap-xml
is rather messed up.
How many pages do you have in those sitemaps in total?
Can you test #317 on the main branch?
@raviqqe Of course! We'll give it a try and report back.
And the sitemap.xml
for our support site currently contains roughly 1,500 pages in total.
@raviqqe The changes you committed to main
work great, thank you so much for such a quick fix! Would it be possible to cut a new release of muffet
containing these changes?
Thanks so much @raviqqe for building this new capability, and @nwidger for integrating it so quickly!
Originally added after #32, the
--follow-sitemap
option should read thesitemap.xml
from a site to determine which pages to check.This doesn't appear to work as expected.
Consider a very simple site with three main pages:
/index.html
,/page1.html
and/secret.html
. All three are listed in the/sitemap.xml
, andindex.html
links topage1.html
- however none of the pages link directly tosecret.html
.When muffet is invoked with
--follow-sitemap
I expect it to read the sitemap and visit each page that is listed there, includingsecret.html
. What it seems to do instead, is to start with/index.html
and scan linked pages recursively only if they're also in the sitemap.Is this the intended behavior of this option? Is there a way to achieve our use case?