Closed Andy1Blue closed 1 month ago
Can you describe your use care in more details? Why do you want to limit checked links to the ones in sitemap.xml
?
@raviqqe I just want to say that the --follow-sitemap-xml
flag is not working properly, every time I'm receiving failed to GET sitemap.xml: 404
, of corse I makes sure that xml file is visible, so probably the fact that this argument is set as a "deprecated" means something: https://github.com/raviqqe/muffet/blob/main/arguments.go
are you able to help me to make it operative again?
Can you share the URL to your website?
Unfortunately not, because sitemap is behind password (sorry, QA protected envirioment). But I also tried with eg. https://nextjs.org/sitemap.xml (the structure of xml file is the same)
The one for Next.js works on my machine?
> muffet --max-connections 1 --follow-sitemap-xml --verbose --one-page-only https://nextjs.org/sitemap.xml
https://nextjs.org/sitemap.xml
200 https://nextjs.org/
200 https://nextjs.org/blog
200 https://nextjs.org/blog/create-next-app
200 https://nextjs.org/blog/incremental-adoption
200 https://nextjs.org/blog/june-2023-update
200 https://nextjs.org/blog/layouts-rfc
200 https://nextjs.org/blog/new-documentation
200 https://nextjs.org/blog/next-10
200 https://nextjs.org/blog/next-10-1
200 https://nextjs.org/blog/next-10-2
200 https://nextjs.org/blog/next-11
200 https://nextjs.org/blog/next-11-1
200 https://nextjs.org/blog/next-12
200 https://nextjs.org/blog/next-12-1
200 https://nextjs.org/blog/next-12-2
200 https://nextjs.org/blog/next-12-3
200 https://nextjs.org/blog/next-13
200 https://nextjs.org/blog/next-13-1
200 https://nextjs.org/blog/next-13-2
200 https://nextjs.org/blog/next-13-3
200 https://nextjs.org/blog/next-13-4
200 https://nextjs.org/blog/next-13-5
200 https://nextjs.org/blog/next-14
...
You can also try Muffet with my website at https://pen-lang.org. It has a sitemap too. (And it works too :)
@raviqqe you are right, next page, also yours works perfectly (sorry, I probably passed wrong args when I tried it before 🤦 ).
I found the reason of my case, my url is like https://www.domain.com/en-en/sitemap.xml
, so the file sitemap.xml
is not visible here: https://www.domain.com/sitemap.xml
, in this place I have https://www.domain.com/sitemap_index.xml
, so probably Muffet doesn’t handle this kind of case. Are you able to confirm it?
Yeah, Muffet uses /sitemap.xml
only. How do you expect others to find the custom sitemap paths? robots.txt
?
Muffet could also handle the sitemap_index.xml (https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps) -> it would solve my "issues"
The problem is more like there is no standard or de facto standard of where to place those split sitemap files. One way is to let Muffet parse sitemaps in robots.txt
as I mentioned above. The locations listed in https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps are just examples. Do you have any idea?
See also https://github.com/h5bp/html5-boilerplate/issues/1895.
Why do you cannot just allow to indicate specific url like muffet --max-connections 1 --follow-sitemap-xml --verbose --one-page-only https://nextjs.org/en-en/sitemap.xml
(right now it is ignored and it will looking for /sitemap.xml on main page anyway) or also add support for sitemap_index.xml (and crawl all the pages from there)?
Why do you cannot just allow to indicate specific url like muffet --max-connections 1 --follow-sitemap-xml --verbose --one-page-only https://nextjs.org/en-en/sitemap.xml
It already works. You don't need to specify the --follow-sitemap-xml
option in this case.
muffet --max-connections 1 --verbose --one-page-only https://pen-lang.org/sitemap.xml
See also #316.
also add support for sitemap_index.xml (and crawl all the pages from there)?
This gets back to the question here:
How do you expect Muffet to find your sitemap files in custom locations?
@raviqqe thanks, ture, it works perfectly
muffet --max-connections 1 --verbose --one-page-only https://pen-lang.org/sitemap.xml
We can close this issue
It would be great to have
follow-sitemap-xml
flag support again. Are you considering to restore this argument? Or maybe is there any way how to user Muffet for checking sitemap.xml?