postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

Skip processing of pages #49

Closed darkcode85 closed 7 years ago

darkcode85 commented 8 years ago

In the documentation says that is possible to skip processing some pages, but I can not find how I can do it, I have tried with ignore_links or ignore_pages but nothing sames to work, eg:

spider = Spidr.site('.....', ignore_links: [%{^/blog/}]) do |spider| spider.every_html_page do |page| //here I still get pages with the /blog url end end

How I can ignore some pages based in the URL?

postmodern commented 7 years ago

ignore_links/ignore_links_like matches the full link (the String form of the URL), so your Regexp is matching against the beginning of the URL not the path. Probably something like spider.ignore_urls_like { |url| url.path.start_with?('/blog/) }. Hope that helps.