theanti9 / PyCrawler

A python web crawler
212 stars 104 forks source link

Don't parse HTML with RegEx #11

Open schlamar opened 11 years ago

schlamar commented 11 years ago

That's just wrong. :warning: There are xml/html parsers like lxml or beautiful soup.

See references:

theanti9 commented 11 years ago

I have seen both of these links before and I'm well aware, however I do not have time to rewrite it. The one upside regex has is that it is much more portable. That doesn't necessarily outweigh the downsides to it, or the benefits of DOM parsers, but it does help when trying to stick something together in a very short amount of time just for fun (the original point of this project). If you have the time and are willing, please feel free to redo it with lxml or beautiful soup (I recommend the latter. I have used it on other things and it's wonderful) and I will gladly accept the changes. This repo does get a lot of attention. I wish I had more time to devote to it, but life is busy.

schlamar commented 11 years ago

No, thanks, already did it :-) http://www.schlamar.org/blog/2010/04/10/python-search-engine-crawler-part-1/

FYI: This took me about 30 minutes of programming, so don't tell me about short amount of time. Doing it right doesn't have to imply that it will take more time than a dirty approach.