Sitemap XML support - Githubissues

buren commented 6 years ago

Overview

Supports index files
Supports gzipped files
Tries common Sitemap XML locations
With robots: true will try to fetch sitemap locations from /robots.txt
Each found URL will be added to the agent queue
⚠️If the server returns text/html for the sitemap no urls will be found, we could be more "liberal" in this situation and allow it..

Public API

Adds sitemap: true/false option to Agent
Adds Agent#sitemap_urls and #initialize_sitemap
Adds Page (tried to follow the same pattern used in page/html.rb):
- gzip?
- each_sitemap_link
- each_sitemap_url
- sitemap_links
- sitemap_urls
- each_sitemap_index_link
- each_sitemap_index_url
- sitemap_index_links
- sitemap_index_urls
- sitemap_index?
- sitemap_urlset?
- sitemap_doc

Usage

Spidr.site(url, sitemap: true)

Common sitemap locations will be tried (/sitemap.xml, etc..).

Spidr.site(url, sitemap: true, robots: true)

will first try to fetch sitemap locations from /robots.txt, if nothing is found there try common sitemap locations.

Common sitemap locations that will be tried (highest priority first):

sitemap.xml
sitemap.xml.gz
sitemap.gz
sitemap_index.xml
sitemap-index.xml
sitemap_index.xml.gz
sitemap-index.xml.gz

robots.txt support / interface

Implicitly enable robots: if sitemap: is enable.

Allow mixing robots: with sitemap:. If robots: is not specified, fallback to /sitemap.xml. This would have to be documented.

Add another option to indicate that you wish to infer sitemap from /robots.txt.

https://github.com/postmodern/spidr/issues/19#issuecomment-221189357

The current implementation implements 2. It would be easy to implement the other variants if thats desirable (Example for 3.).

Or a more "fancy" interface

Spidr.site(url, sitemap: :robots) # check /robots.txt

Support non-default locations that aren't listed in /robots.txt, the sitemap protocol allows Sitemaps to be "scoped" under a path, to support that we could allow for this:

Spidr.site(url, sitemap: '/catalog/sitemap.xml')

Here is a diff for a commit that adds support for it.

Links

Sitemap XML protocol - https://www.sitemaps.org/protocol.html
Previous discussion in #19.

postmodern commented 2 years ago

I like the sitemap: :robots feature. Although maybe it should be a separate option, like robots_sitemap: true?

postmodern commented 2 years ago

Regardless of my suggestions, this is good work and a good feature idea!

postmodern / spidr

Sitemap XML support #69