raviqqe / muffet

Fast website link checker in Go
MIT License
2.49k stars 96 forks source link

Support for follow-sitemap-xml #398

Closed Andy1Blue closed 1 month ago

Andy1Blue commented 1 month ago

It would be great to have follow-sitemap-xml flag support again. Are you considering to restore this argument? Or maybe is there any way how to user Muffet for checking sitemap.xml?

raviqqe commented 1 month ago

Can you describe your use care in more details? Why do you want to limit checked links to the ones in sitemap.xml?

Andy1Blue commented 1 month ago

@raviqqe I just want to say that the --follow-sitemap-xml flag is not working properly, every time I'm receiving failed to GET sitemap.xml: 404, of corse I makes sure that xml file is visible, so probably the fact that this argument is set as a "deprecated" means something: https://github.com/raviqqe/muffet/blob/main/arguments.go

are you able to help me to make it operative again?

raviqqe commented 1 month ago

Can you share the URL to your website?

Andy1Blue commented 1 month ago

Unfortunately not, because sitemap is behind password (sorry, QA protected envirioment). But I also tried with eg. https://nextjs.org/sitemap.xml (the structure of xml file is the same)

raviqqe commented 1 month ago

The one for Next.js works on my machine?

> muffet --max-connections 1 --follow-sitemap-xml --verbose --one-page-only https://nextjs.org/sitemap.xml
https://nextjs.org/sitemap.xml
        200     https://nextjs.org/
        200     https://nextjs.org/blog
        200     https://nextjs.org/blog/create-next-app
        200     https://nextjs.org/blog/incremental-adoption
        200     https://nextjs.org/blog/june-2023-update
        200     https://nextjs.org/blog/layouts-rfc
        200     https://nextjs.org/blog/new-documentation
        200     https://nextjs.org/blog/next-10
        200     https://nextjs.org/blog/next-10-1
        200     https://nextjs.org/blog/next-10-2
        200     https://nextjs.org/blog/next-11
        200     https://nextjs.org/blog/next-11-1
        200     https://nextjs.org/blog/next-12
        200     https://nextjs.org/blog/next-12-1
        200     https://nextjs.org/blog/next-12-2
        200     https://nextjs.org/blog/next-12-3
        200     https://nextjs.org/blog/next-13
        200     https://nextjs.org/blog/next-13-1
        200     https://nextjs.org/blog/next-13-2
        200     https://nextjs.org/blog/next-13-3
        200     https://nextjs.org/blog/next-13-4
        200     https://nextjs.org/blog/next-13-5
        200     https://nextjs.org/blog/next-14
...

You can also try Muffet with my website at https://pen-lang.org. It has a sitemap too. (And it works too :)

Andy1Blue commented 1 month ago

@raviqqe you are right, next page, also yours works perfectly (sorry, I probably passed wrong args when I tried it before 🤦 ). I found the reason of my case, my url is like https://www.domain.com/en-en/sitemap.xml, so the file sitemap.xml is not visible here: https://www.domain.com/sitemap.xml, in this place I have https://www.domain.com/sitemap_index.xml, so probably Muffet doesn’t handle this kind of case. Are you able to confirm it?

raviqqe commented 1 month ago

Yeah, Muffet uses /sitemap.xml only. How do you expect others to find the custom sitemap paths? robots.txt?

Andy1Blue commented 1 month ago

Muffet could also handle the sitemap_index.xml (https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps) -> it would solve my "issues"

raviqqe commented 1 month ago

The problem is more like there is no standard or de facto standard of where to place those split sitemap files. One way is to let Muffet parse sitemaps in robots.txt as I mentioned above. The locations listed in https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps are just examples. Do you have any idea?

See also https://github.com/h5bp/html5-boilerplate/issues/1895.

Andy1Blue commented 1 month ago

Why do you cannot just allow to indicate specific url like muffet --max-connections 1 --follow-sitemap-xml --verbose --one-page-only https://nextjs.org/en-en/sitemap.xml (right now it is ignored and it will looking for /sitemap.xml on main page anyway) or also add support for sitemap_index.xml (and crawl all the pages from there)?

raviqqe commented 1 month ago

Why do you cannot just allow to indicate specific url like muffet --max-connections 1 --follow-sitemap-xml --verbose --one-page-only https://nextjs.org/en-en/sitemap.xml

It already works. You don't need to specify the --follow-sitemap-xml option in this case.

muffet --max-connections 1 --verbose --one-page-only https://pen-lang.org/sitemap.xml

See also #316.

also add support for sitemap_index.xml (and crawl all the pages from there)?

This gets back to the question here:

How do you expect Muffet to find your sitemap files in custom locations?

Andy1Blue commented 1 month ago

@raviqqe thanks, ture, it works perfectly muffet --max-connections 1 --verbose --one-page-only https://pen-lang.org/sitemap.xml

Andy1Blue commented 1 month ago

We can close this issue