If we are mapping a while domain, we could probably read the robots.txt file to find any XML sitemaps, and use all the URLs listed in those maps as seeds.
Some domains have a lot of separate sitemaps (EPA is kind of nuts—101 sitemaps, many of which list only a single URL) and most probably have none, so this is by no means a core or essential feature. It would be pretty useful, though.
If we are mapping a while domain, we could probably read the
robots.txt
file to find any XML sitemaps, and use all the URLs listed in those maps as seeds.For example, FERC has this
robots.txt
file:Which leads to this sitemap:
Some domains have a lot of separate sitemaps (EPA is kind of nuts—101 sitemaps, many of which list only a single URL) and most probably have none, so this is by no means a core or essential feature. It would be pretty useful, though.