qri-io / walk

Webcrawler/sitemapper
GNU General Public License v3.0
6 stars 2 forks source link

Seed domains from sitemaps in robots.txt #30

Open Mr0grog opened 5 years ago

Mr0grog commented 5 years ago

If we are mapping a while domain, we could probably read the robots.txt file to find any XML sitemaps, and use all the URLs listed in those maps as seeds.

For example, FERC has this robots.txt file:

# robots.txt generated at http://www.mcanerin.com
User-agent: *
Disallow: 
Disallow: /cgi-bin/
Sitemap:http://www.ferc.gov/sitemap.xml

Which leads to this sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Sitemap File Generated by https://freesitemapgenerator.com/ at Thu, 16 Feb 2017 18:41:09 +0100 -->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
                           http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>http://www.ferc.gov/</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.00</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/CalendarFiles/20170208112439-No%20meeting.pdf</loc>
        <lastmod>1970-01-01T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.31</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/EventCalendar/EventsList.aspx?View=listview</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.30</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/contact-us/compliance-help-desk.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/whats-new/registration/vegetation-mgt-issues-form.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/resources/glossary.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.24</priority>
    </url>
    <url>
        <loc>http://www.ferc.gov/resources/acronyms.asp</loc>
        <lastmod>2017-02-16T18:41:09+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.23</priority>
    </url>
</urlset>

Some domains have a lot of separate sitemaps (EPA is kind of nuts—101 sitemaps, many of which list only a single URL) and most probably have none, so this is by no means a core or essential feature. It would be pretty useful, though.