taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Whitelist start urls? #43

Open janpieper opened 10 years ago

janpieper commented 10 years ago

If you use #follow_links_like and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?

start_urls = [ "http://www.example.com/foo/bar" ]
Polipus.crawler("dummy", start_urls, options) do |crawler|
  crawler.follow_links_like(/\/bar\/foo/)
end

The links on the start page match the given regexp.

tmaier commented 10 years ago

At https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L163 we check for #should_be_visited?. This is to allow skipping an url when the policy has changed during the crawl session but the page was already queued.

#should_be_visited? https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L351 returns false when the link does not match the pattern.

#page_exists? already checks for page.user_data.p_seeded. Maybe we need to check for this value also in the case above.