Extracting image captions with ArticleBody

scrapinghub / extruct

Extract embedded metadata from HTML markup

BSD 3-Clause "New" or "Revised" License

857 stars 113 forks source link

Extracting image captions with ArticleBody #108

Closed spidysenses closed 5 years ago

spidysenses commented 5 years ago

This package is great. Thanks for it and other packages from scrapinghub.

Image captions and credits are included in article body. It is messing up with article content.

Example URL https://www.nytimes.com/2019/03/03/us/tornado-alabama-georgia-deaths.html?action=click&module=Top%20Stories&pgtype=Homepage

lopuhin commented 5 years ago

@spidysenses thank you for your kind words.

Image captions and credits are included in article body. It is messing up with article content.

this is expected - the goal of this library is to extract html element as marked by it's creators, without doing any extra cleanup or normalization. You can try using libraries like newspaper/dragnet/boilerpipe to extract just the article text.