re.match(link_regex, link) may change to re.search(....) in chp1/link_crawler.py

kjam / wswp

Code for the second edition Web Scraping with Python book by Packt Publications

129 stars 98 forks source link

re.match(link_regex, link) may change to re.search(....) in chp1/link_crawler.py #2

Closed ProfFL028 closed 6 years ago

ProfFL028 commented 7 years ago

thanks for the book. when i run the code in "Link Crawler" section, and it just download "http://example.webscraping.com" only, after i dig into the code and change re.match(...) to re.search in the "if" statement, the code works out.

kjam commented 7 years ago

Hi @ProfFL028, thanks for reading :) Can you tell me what regex pattern you are using?

ProfFL028 commented 7 years ago

hi, i mean the code in page 26: link_crawler('http://example.webscraping.com', '/(index|view)/') it just download the main url without download any url match '/(index/view)/'. so i dig into the code,and find out that re.match will only match the pattern begin with, while re.search would fix the bug.

kjam commented 6 years ago

I actually had a chance to look at this (sorry for the delay). The best fix is to actually change the regex to: (/places/default/index|/places/default/view). Hope that helps!