scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 315 forks source link

refactor text extractor and ignore xml declarations #36

Closed kalessin closed 11 years ago

kalessin commented 11 years ago

_process_markup was previously used to remove tags and comments. Not needed inside text: much simple to use region text_content, which already removes tags, comments and now, xml declarations.