Don't parse HTML with RegEx

schlamar commented 11 years ago

That's just wrong. :warning: There are xml/html parsers like lxml or beautiful soup.

See references:

theanti9 commented 11 years ago

I have seen both of these links before and I'm well aware, however I do not have time to rewrite it. The one upside regex has is that it is much more portable. That doesn't necessarily outweigh the downsides to it, or the benefits of DOM parsers, but it does help when trying to stick something together in a very short amount of time just for fun (the original point of this project). If you have the time and are willing, please feel free to redo it with lxml or beautiful soup (I recommend the latter. I have used it on other things and it's wonderful) and I will gladly accept the changes. This repo does get a lot of attention. I wish I had more time to devote to it, but life is busy.

schlamar commented 11 years ago

No, thanks, already did it :-) http://www.schlamar.org/blog/2010/04/10/python-search-engine-crawler-part-1/

FYI: This took me about 30 minutes of programming, so don't tell me about short amount of time. Doing it right doesn't have to imply that it will take more time than a dirty approach.

theanti9 / PyCrawler

Don't parse HTML with RegEx #11