sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

Sitemap.xml parsedlinks is empty #220

Closed JoshTango closed 3 years ago

JoshTango commented 4 years ago

If I feed in a sitemap.xml link into abot the parsedlinks is null. Now alot of websites with sitemap.xml look like this:

https://www.somesite.com/sitemap-posts1.xml.gz 2020-06-29T04:08:20-04:00 the url link seems to be between tags
sjdirect commented 4 years ago

Have you tried creating your own IHyperLinkParser or Extending the AnglesharpHyperlinkParser to implement this logic. Wouldn't be hard to do. You would also need to change the following to make sure it would download the content of the sitemap url...

        config.DownloadableContentTypes = "text/html, application/xml";
JoshTango commented 4 years ago

I might one day. but the sitemap.xml is suck a generalized standard thing these days I thought you might want to build it in to Abot

winzig commented 3 years ago

Abot doesn't use sitemaps to help discover pages to crawl?

sjdirect commented 3 years ago

It's default behavior is to crawl the site based on real navigate-able links. The sitemap can be completely out of sync with the real site so was never part of the original design. However, you can implement your own IHyperLinkParser like mentioned above that will use the sitemap.

winzig commented 3 years ago

In my experience, we have used sitemaps extensively to help search engines index pages of our sites that they may otherwise have trouble finding. So yeah we'll have to implement this internally I guess.