Extract feed and item images from more places

mmcdole / gofeed

Parse RSS, Atom and JSON feeds in Go

MIT License

2.56k stars 208 forks source link

Extract feed and item images from more places #220

Closed infogulch closed 7 months ago

infogulch commented 7 months ago

Additional locations where images are attempted to be extracted:

media:content extension https://www.rssboard.org/media-rss#media-content
The first <img> in content or description

Fixes #133

infogulch commented 7 months ago

Besides a few tests that have the issue mentioned above in the review I think this should work fine.

I'd like to get some input on the review above before I convert this from a draft.

mmcdole commented 7 months ago

@infogulch I think the fallback image sources in the translator function you added look clean and make sense to me, including the HTML parsing code. I had no clue that many images stash their images in there, lol.

mmcdole commented 7 months ago

@infogulch update looks good to me.

I might create a separate issue to think about what to do with naked HTML markup within tags.

mmcdole commented 7 months ago

Thank you for your contribution @infogulch !

Now I just need to tackle #210, and hopefully turn back on gating of PRs for tests passing.

spacecowboy commented 7 months ago

I'd like to comment that fetching the first <img> inside body isn't such a great idea.

Take for example the feed from slashdot: https://rss.slashdot.org/Slashdot/slashdotMain

The first image in the body will be https://a.fsdn.com/sd/twitter_icon_large.png which is 56x20 pixels. This is directly unsuitable as a thumbnail for an article.

Perhaps it would be better to place the first body image as an extension? Then clients can choose if they want to consider it or not?