theanti9 / PyCrawler

A python web crawler
212 stars 104 forks source link

found only one link per line #7

Closed gtoffoli closed 13 years ago

gtoffoli commented 13 years ago

Hi! I noticed that when multiple links are present in a line, only the last one is matched. I found that linkregex = re.compile('<a\s.?href=[\'"](.?)[\'"].?>') often is ok. But perhaps linkregex = re.compile('<a\s(?:.?\s)?href=[\'"](.?)[\'"].*?>') is better. Regards, Giovanni

gtoffoli commented 13 years ago

Oops! I realized that special characters aren't automatically escaped in this HTML page. The first solution above consists in putting in the regexp, after the opening A tag, the code for "any whitespace character", followed by dot, star, question mark (the syntax matching "as few repetitions as possible" of "any character except newline").

theanti9 commented 13 years ago

That's pretty interesting that it would do that. I changed it to the regex pattern you suggested for now. Now that I have some free time I think I'm going to sit down and rewrite the whole thing, so we'll see how it turns out!