Seed domains from sitemaps in robots.txt

If we are mapping a while domain, we could probably read the robots.txt file to find any XML sitemaps, and use all the URLs listed in those maps as seeds.

For example, FERC has this robots.txt file:

# robots.txt generated at http://www.mcanerin.com
User-agent: *
Disallow: 
Disallow: /cgi-bin/
Sitemap:http://www.ferc.gov/sitemap.xml

Which leads to this sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Sitemap File Generated by https://freesitemapgenerator.com/ at Thu, 16 Feb 2017 18:41:09 +0100 -->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
                           http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>http://www.ferc.gov/</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.00</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/CalendarFiles/20170208112439-No%20meeting.pdf</loc>
        <lastmod>1970-01-01T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.31</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/EventCalendar/EventsList.aspx?View=listview</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.30</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/contact-us/compliance-help-desk.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/whats-new/registration/vegetation-mgt-issues-form.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/resources/glossary.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/resources/acronyms.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.23</priority>
    </url>
</urlset>

Some domains have a lot of separate sitemaps (EPA is kind of nuts—101 sitemaps, many of which list only a single URL) and most probably have none, so this is by no means a core or essential feature. It would be pretty useful, though.

qri-io / walk

Seed domains from sitemaps in robots.txt #30