Image always nil - Githubissues

dewsy commented 4 years ago

Expected behavior

Item.Image returns string

Actual behavior

Item.Image is nil

Steps to reproduce the behavior

I tried ~2000 articles from 279 different RSS feeds (all of them were XML) and all of the 2000 goffed.Item had nil for image. I looked at the XML feeds, there were image links in them. It must be a parsing errror.

sudhanshuraheja commented 4 years ago

@dewsy could you share links to a few RSS feeds where you were unable to get this to work?

arrow-dev commented 4 years ago

Author

The author doesn't get parsed properly because the RSS spec expects <author>lawyer@boyer.net (Lawyer Boyer)</author>

but the XML contains

<author>
    <name>Howard</name>
    <email>the_actual_email@site.com</email>
</author>

Image Data

I can see an image tag inside items, like so

<image>
    <url>https://media.businessdesk.co.nz/file/c_fill,w_330/Invercargill-Dee-St.jpg</url>
    <title>Tiwai closure: Southlanders look for options</title>
    <link>https://businessdesk.co.nz/article/tiwai-closure-southlanders-look-for-options</link>
</image>

Though, what I find strange is that the RSS spec doesn't have image as a tag inside items

I believe this is why, the default mappings for gofeed.Item uses only the following for Rss /rss/channel/item/itunes:image and /rss/channel/item/media:image and none for Atom.

Possible Solutions

Overall, I'm not sure about raising a PR in GoFeed for parsing elements that are not in the spec. Here's the documentation about using a custom parser

arrow-dev commented 4 years ago

@sudhanshuraheja Thanks for that info, first time dealing with RSS. With the Author it looked like it was lining up with the struct in GoFeed and the same with Image. Looks like this feed is not correctly formatted to the spec and I will need to do a custom parser. - I agree there would be no need for a PR to parse elements not in the spec.

prologic commented 3 years ago

@arrow-dev Did you end up writing a custom parser? I too would like to be able to extract images from various RSS/Atom feeds for https://feeds.twtxt.net/ (sigh no one follows the spec!)

arrow-dev commented 3 years ago

@prologic Actually just ended up using the standard library as shown here

prologic commented 3 years ago

@prologic Actually just ended up using the standard library as shown here

Can you share your code? I don't really feel like reinventing Atom/RSS parsing for what amounts to similar problems you had with gofeed :)

arrow-dev commented 3 years ago

@prologic I'm not able to share the code but basically it is exactly like in that linked example from godoc,

define type(s)/struct(s) reflecting the fields that you want to pull out, in the same shape as the feed/xml with meta tags to map the xml field names and/or attributes
instantiate a variable with the empty/default value of that struct
call xml.Unmarshal(xml feed as bytes, pointer to the var to populate with data) and you should end up with the struct being populated

infogulch commented 1 year ago

It seems images are all over the place. I'm seeing:

extensions.media.content where attrs.medium = 'image' or attrs.type = 'image/*' (gmail blog another example of surrounding the author name with more xml tags, xkcd)
enclosures where type = 'image/*' (androidpolice)
the only image is the first <img> in content. (Ars, fivethirtyeight, marco.org)
the only image is the first <img> in description

Would gofeed devs accept a patch that does a best-effort search in these locations (others?) to fill in the Image field? Any arbitrary search order is ok with me, it would just be nice to have any preview thumbnail for a bare bones rss reader that I'm building without having to break out the parsers and query engines.

rbren commented 7 months ago

+1 to this--nytimes uses media:content medium=image. I'm parsing a bunch of different major news sources, and none get images set.

Would love to get some more best-effort in for getting images.

(PS as someone who maintains an RSS parser in JS, I know it's a slog. Thanks for the hard work here)

rbren commented 7 months ago

Here's a workaround that seems to capture most images:

func getImageFromExtensions(item gofeed.Item) string {
    if media, ok := item.Extensions["media"]; ok {
        if content, ok := media["content"]; ok {
            for _, c := range content {
                if strings.Contains(c.Attrs["type"], "image") || strings.Contains(c.Attrs["medium"], "image") {
                    return c.Attrs["url"]
                }
            }
        }
    }
    return ""
}

Edit: to capture HTML images, like XKCD, here's a more complex pass using "golang.org/x/net/html"

import (
    "strings"

    "golang.org/x/net/html"
    "github.com/mmcdole/gofeed"
)

func getImageFromExtensions(item gofeed.Item) string {
        if media, ok := item.Extensions["media"]; ok {
                if content, ok := media["content"]; ok {
                        for _, c := range content {
                                if strings.Contains(c.Attrs["type"], "image") || strings.Contains(c.Attrs["medium"], "image") {
                                        return c.Attrs["url"]
                                }
                        }
                }
        }
        docImg := findImageInHTML(item.Description)
        if docImg != "" {
                return docImg
        }
        return ""
}

func findImageInHTML(contents string) string {
        doc, err := html.Parse(bytes.NewBufferString(contents))
        if err != nil {
                return ""
        }
        return findImageInHTMLNode(doc)
}

func findImageInHTMLNode(n *html.Node) string {
        if n.Type == html.ElementNode && n.Data == "img" {
                for _, a := range n.Attr {
                        if a.Key == "src" {
                                return a.Val
                        }
                }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
                if img := findImageInHTMLNode(c); img != "" {
                        return img
                }
        }
        return ""
}

mmcdole commented 6 months ago

+1 to this--nytimes uses media:content medium=image. I'm parsing a bunch of different major news sources, and none get images set.

Would love to get some more best-effort in for getting images.

I agree.

Some of the existing "best-effort" code for parsing various fields is in the translator file, specifically in the DefaultRSSTranslator. This file converts the raw RSS object into the unified gofeed format. You can see that we pull some fields out of the iTunesExt object as fallbacks, etc. This approach is convenient because it ensures the RSS feed and item content are fully parsed before we delve into them for various data to populate fallbacks.

[RSS Parser] > [RSS Feed] > [DefaultRSSTranslator] > [Gofeed Obj]

The question is whether we should add this functionality to the parser, rss/parser.go, directly to perform best-effort image parsing, or stick to the above pattern and do this in the DefaultRSSTranslator.

I was leaning towards keeping the RSS / Atom / other parsers focused on parsing the "raw" feed exactly as it is represented, while the gofeed + translator handles some quality of life features like best-effort parsing of various deeply nested fields. It would allow others to make different choices here for precedence, etc.

This approach does slightly lower the value or use-case of someone using the feed-specific RSS parser directly, I suppose. I'm not sure how much that matters as I suspect most people use the gofeed parser?

Any thoughts, @rbren?

mmcdole commented 6 months ago

@infogulch if the offer is still on the table, I'd accept a PR that covers the various use-cases you covered in DefaultRSSTranslator. It looks like @rbren's code above covers some but not all of them that you mentioned.

If not, I can take a look at this -- it just will take a bit longer.

sudhanshuraheja commented 6 months ago

Agree on it being the right strategy

On Wed, 21 Feb 2024 at 01.10 superb-owl @.***> wrote:

I was leaning towards keeping the RSS / Atom / other parsers focused on parsing the "raw" feed exactly as it is represented, while the gofeed + translator handles some quality of life features like best-effort parsing of various deeply nested fields.

This sounds like a great strategy

— Reply to this email directly, view it on GitHub https://github.com/mmcdole/gofeed/issues/133#issuecomment-1954792263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAG2AET4IUT4B27ZSUC6UTYUTRJDAVCNFSM4JVDFET2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJVGQ3TSMRSGYZQ . You are receiving this because you were mentioned.Message ID: @.***>

infogulch commented 6 months ago

Sure I'll take a crack at it. rbren's code is a good starting point.

I guess it's fine to parse with "golang.org/x/net/html" and dig around in the nodes by hand instead of using something like "github.com/antchfx/htmlquery" since we have a simple node search task.

Necoro commented 6 months ago

@infogulch as goquery is already a dependency, you could just use that, can't you?

infogulch commented 6 months ago

Oh it is! That could make it a bit easier.

infogulch commented 6 months ago

I opened a PR (above) to fix this.

Thoughts?

mmcdole commented 6 months ago

Looks great, thank you for the contribution !

mmcdole / gofeed

Image always nil #133

Expected behavior

Actual behavior

Steps to reproduce the behavior

Author

Image Data

Possible Solutions