taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

RegularExpression To Follow a Link #6

Closed nengine closed 10 years ago

nengine commented 10 years ago

When I use the regex like below it would not crawl.

crawler.follow_links_like(/show.php\?id=[A-Z]/) 

however if I remove the id parameter it works.

crawler.follow_links_like(/show.php/) 

Please let me know regex to match dynamic parameters are supported? Many thanks!

taganaka commented 10 years ago

follow_links_like works running regex against url.path that doesn't contain query string.

you can use focus_crawl and use your logic to extract all of the links you are interested in contained in the page:

crawler.focus_crawl do |page|
  page.links.select{ |url| url.to_s =~ /show.php\?id=[A-Z]+$/ }.uniq
end