Open janpieper opened 10 years ago
At https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L163 we check for #should_be_visited?
. This is to allow skipping an url when the policy has changed during the crawl session but the page was already queued.
#should_be_visited?
https://github.com/taganaka/polipus/blob/master/lib/polipus.rb#L351 returns false when the link does not match the pattern.
#page_exists?
already checks for page.user_data.p_seeded
. Maybe we need to check for this value also in the case above.
If you use
#follow_links_like
and the given start urls does not match the configured regexps, the crawler stops working. Is there a reason why the start urls aren't whitelisted?The links on the start page match the given regexp.