Closed spidysenses closed 5 years ago
@spidysenses thank you for your kind words.
Image captions and credits are included in article body. It is messing up with article content.
this is expected - the goal of this library is to extract html element as marked by it's creators, without doing any extra cleanup or normalization. You can try using libraries like newspaper/dragnet/boilerpipe to extract just the article text.
This package is great. Thanks for it and other packages from scrapinghub.
Image captions and credits are included in article body. It is messing up with article content.
Example URL https://www.nytimes.com/2019/03/03/us/tornado-alabama-georgia-deaths.html?action=click&module=Top%20Stories&pgtype=Homepage