taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

URL patching #7

Closed nengine closed 10 years ago

nengine commented 10 years ago

Hello, I have a pattern "%E2%80%93" in the URL strings and need to replace that with "%96" before a Page is saved. Some websites use strange characters in the URLs and I discovered that some of those strange characters must be replaced, otherwise URL cannot be visited. I believe these URLs are stored as links on a page.

Please let me know if there is a way to replace URL based on some regex pattern before a page stored?

taganaka commented 10 years ago

you can use focus_crawl and use your logic to extract/patch all of the links that will be visited

crawler.focus_crawl do |page|
  page.links.map{ |link| URI.encode(URI.decode(link.to_s.gsub("%E2%80%93","%96"))) }.uniq
end
nengine commented 10 years ago

Ok. Thank you very much.

nengine commented 10 years ago

Is there an option for crawl delay?

taganaka commented 10 years ago

you can enable the sleeper plugin: https://github.com/taganaka/polipus/blob/master/lib/polipus/plugins/sleeper.rb

or

polipus.on_page_dowloaded {|page| sleep 1}
nengine commented 10 years ago

Hi thanks!