scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 315 forks source link

Move big chunk of HTML parser to cython #86

Closed plafl closed 8 years ago

plafl commented 8 years ago

Most of the regexps used for parsing HTML have been moved to hand coded cython code. Only attribute parsing (which is only executed when needed) is being parsed right now with regexps.

Benchmarks say that the new code is 3x faster (typical parse speed moved from 60ms to 30ms per page).

ruairif commented 8 years ago

Would you mind updating requirements.txt too?

plafl commented 8 years ago

requirements.txt has been updated already, but I'm working still in making travis compile the cython extension

plafl commented 8 years ago

At last, it passes all tests