postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
799 stars 109 forks source link

Automatically detect and parse sitemap.xml #19

Open postmodern opened 13 years ago

postmodern commented 13 years ago

Automatically detecting and parsing /sitemap.xml might be a good way to cut down on spidering depth.

nofxx commented 8 years ago

Going to do this before invoking spidr... using a lil gem: https://github.com/benbalter/sitemap-parser If it looks good, I'll pull request something like sitemap: true to look simbling to robots: true. Sounds good? Thank you for spidr!

postmodern commented 8 years ago

I'd prefer the /sitemap.xml is requested by the agent, and we own the parsing logic. The XML scheme seems simple enough, could probably parse it with a single XPath?

nofxx commented 8 years ago

Cool, gonna try that way. Also, we need robots to do that: The filename 'sitemap.xml' isn't default. I've seen some under differnt names, and the name is in robots.txt sitemap: key.

nofxx commented 8 years ago

Ah, sorry...that's implicit in the subject 'Automatically detect xml...'. Any other way besides robots.txt? Also, maybe another option sitemap: may receive true or false or http://url.to/sitemap.xml.

postmodern commented 8 years ago

Was going to say TIL! Always thought /sidemap.xml was a defacto standard and not configurable.

I see three possible implementations for the sitemap: option:

  1. Implicitly enable robots: if sitemap: is enable.
  2. Allow mixing robots: with sitemap:. If robots: is not specified, fallback to /sitemap.xml. This would have to be documented.
  3. Add another option to indicate that you wish to infer sitemap from /robots.txt.
kcalmes commented 6 years ago

If this sitemap feature is in the source already, I did not see it. If it is there could you point me to it? If not, is it still a feature being developed and is there an update on progress? If not, is there a workaround? I am parsing a sitemap currently but I cannot seem to identify a way to add my results into the crawler.

My need has come about because of new websites that are client-side rendering js and they are not using traditional anchor tag structures for links so they are not crawlable. The only way I can get some data is to have the crawler seed in the URLs from the sitemap. Otherwise crawling stops on the homepage.

buren commented 6 years ago

Sitemap also have index files that in turn define locations for other sitemaps. They can be gzipped and other common locations are

sitemap_index.xml.gz
sitemap-index.xml.gz
sitemap_index.xml
sitemap-index.xml
sitemap.xml.gz
sitemap.xml

As for the sitemap option

Spidr.site(url, sitemap: :robots) # check /robots.txt
Spidr.site(url, sitemap: true) # check default locations, maybe /robots.txt first?
Spidr.site(url, sitemap: '/some-non-default-location.xml')

Q: Only queue all found URLs in sitemap or keep crawling the site? Q: What if robots.txt is 404, errors, or doesn't define a sitemap location? Q: What if the sitemap location is 404 or errors?

Sitemap protocol: https://www.sitemaps.org/protocol.html