A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Support non-default locations that aren't listed in /robots.txt, the sitemap protocol allows Sitemaps to be "scoped" under a path, to support that we could allow for this:
Overview
robots: true
will try to fetch sitemap locations from/robots.txt
text/html
for the sitemap no urls will be found, we could be more "liberal" in this situation and allow it..Public API
sitemap: true/false
option toAgent
Agent#sitemap_urls
and#initialize_sitemap
Page
(tried to follow the same pattern used inpage/html.rb
):gzip?
each_sitemap_link
each_sitemap_url
sitemap_links
sitemap_urls
each_sitemap_index_link
each_sitemap_index_url
sitemap_index_links
sitemap_index_urls
sitemap_index?
sitemap_urlset?
sitemap_doc
Usage
Common sitemap locations will be tried (
/sitemap.xml
, etc..).will first try to fetch sitemap locations from
/robots.txt
, if nothing is found there try common sitemap locations.Common sitemap locations that will be tried (highest priority first):
robots.txt support / interface
The current implementation implements 2. It would be easy to implement the other variants if thats desirable (Example for 3.).
Or a more "fancy" interface
Support non-default locations that aren't listed in
/robots.txt
, the sitemap protocol allows Sitemaps to be "scoped" under a path, to support that we could allow for this:Here is a diff for a commit that adds support for it.
Links