grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.97k stars 786 forks source link

Android central page not parsed #111

Open kambanthemaker opened 10 years ago

kambanthemaker commented 10 years ago

This one fails. http://www.androidcentral.com/lg-g3-available-pre-order-uk-499-shipping-july-1st

It will be nice to get your input on this.

kambanthemaker commented 10 years ago

clean() method in cleaners.py has remove_scripts_styles call which seems to remove all content due to a formatting issue in original doc.

As a temporary fix, I am using regex to remove all scripts tag before generating document.