postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
805 stars 109 forks source link

Sitemap XML support #69

Open buren opened 6 years ago

buren commented 6 years ago

Overview

Public API

Usage

Spidr.site(url, sitemap: true)

Common sitemap locations will be tried (/sitemap.xml, etc..).

Spidr.site(url, sitemap: true, robots: true)

will first try to fetch sitemap locations from /robots.txt, if nothing is found there try common sitemap locations.

Common sitemap locations that will be tried (highest priority first):

sitemap.xml
sitemap.xml.gz
sitemap.gz
sitemap_index.xml
sitemap-index.xml
sitemap_index.xml.gz
sitemap-index.xml.gz

robots.txt support / interface

  1. Implicitly enable robots: if sitemap: is enable.
  2. Allow mixing robots: with sitemap:. If robots: is not specified, fallback to /sitemap.xml. This would have to be documented.
  3. Add another option to indicate that you wish to infer sitemap from /robots.txt.

https://github.com/postmodern/spidr/issues/19#issuecomment-221189357

The current implementation implements 2. It would be easy to implement the other variants if thats desirable (Example for 3.).

Or a more "fancy" interface

Spidr.site(url, sitemap: :robots) # check /robots.txt

Support non-default locations that aren't listed in /robots.txt, the sitemap protocol allows Sitemaps to be "scoped" under a path, to support that we could allow for this:

Spidr.site(url, sitemap: '/catalog/sitemap.xml')

Here is a diff for a commit that adds support for it.

Links

postmodern commented 2 years ago

I like the sitemap: :robots feature. Although maybe it should be a separate option, like robots_sitemap: true?

postmodern commented 2 years ago

Regardless of my suggestions, this is good work and a good feature idea!