mmcdole / gofeed

Parse RSS, Atom and JSON feeds in Go
MIT License
2.51k stars 204 forks source link

Enhance custom element handling #205

Open mmcdole opened 1 year ago

mmcdole commented 1 year ago

When gofeed encounters an element that it doesn't know expect, it will call parseText on it and attempt to store the value in a item.Custom map. parseText calls decode element and this will fail if the the element has nested structure. The handling of this seems pretty insufficient.

If we wanted to revamp custom element handling, where Gofeed encounters a tag in a feed that isn't a part of the spec, does anyone have feedback between two possible ways of representing it?

Option A - Nested Maps

func parseUnknownElement(p *xpp.XMLPullParser) (map[string]interface{}, error) {
    element := make(map[string]interface{})
    elementName := p.Name

    // Store attributes
    for _, attr := range p.Attrs {
        attrName := attr.Name.Local
        attrValue := attr.Value
        element["_attr_"+attrName] = attrValue
    }

    for {
        tok, err := shared.NextTag(p)
        if err != nil {
            return nil, err
        }

        if tok == xpp.EndTag && p.Name == elementName {
            break
        }

        if tok == xpp.StartTag {
            name := p.Name
            innerElement, err := parseUnknownElement(p)
            if err != nil {
                return nil, err
            }
            element[name] = innerElement
        } else {
            text, err := shared.ParseText(p)
            if err != nil {
                return nil, err
            }
            element["_text"] = text
        }
    }

    return element, nil
}

Where, if it encountered this torrent tag, the value would look something like:

"torrent": map[string]interface{}{
    "_attr_xmlns": "http://xmlns.ezrss.it/0.1/",
    "fileName": map[string]interface{}{
        "_text": "Cat.torrent",
    },
    "infoHash": map[string]interface{}{
        "_text": "A%2B%10%D5l%F2%BC%B3%F3%5E%D1%14qM%15%F2%26%C4%F0Z",
    },
    "contentLength": map[string]interface{}{
        "_text": "62643697701",
    },
    "contentLengthHR": map[string]interface{}{
        "_text": "58.34 GiB",
    },
}

Option B - Decode Element to a CustomElement struct

type CustomElement struct {
    XMLName xml.Name
    Data    string `xml:",innerxml"`
    Attrs   []xml.Attr `xml:",any,attr"`
}

func parseUnknownElement(p *xpp.XMLPullParser) (CustomElement, error) {
    element := CustomElement{}

    err := p.DecodeElement(&element)
    if err != nil {
        return element, err
    }

    return element, nil
}

Where the value would be roughly:

"torrent": CustomElement{
    XMLName: xml.Name{
        Space: "http://xmlns.ezrss.it/0.1/",
        Local: "torrent",
    },
    Data: `
        <fileName><![CDATA[Cat.torrent]]></fileName>
        <infoHash><![CDATA[A%2B%10%D5l%F2%BC%B3%F3%5E%D1%14qM%15%F2%26%C4%F0Z]]></infoHash>
        <contentLength>62643697701</contentLength>
        <contentLengthHR>58.34 GiB</contentLengthHR>
    `,
    Attrs: []xml.Attr{
        {
            Name: xml.Name{
                Space: "",
                Local: "xmlns",
            },
            Value: "http://xmlns.ezrss.it/0.1/",
        },
    },
}

@cristoper, @imirzadeh or @rodmcelrath do any of you have opinions on what seems like a better approach?

mmcdole commented 1 year ago

Should I also consider handling custom elements within other elements? Like, within an <author> element for example?

Should each child struct have a customelement or map?

mmcdole commented 1 year ago

We also have our extension capability that parses arbitrary extensions.

Perhaps we should get rid of the CustomElement idea and just tweak the extensions capability so it will parse anything, even non-namespaced extensions

https://github.com/mmcdole/gofeed/blob/master/rss/parser.go#L144

rodmcelrath commented 1 year ago

Yes!  This!Sent from my iPhoneOn Mar 25, 2023, at 5:08 PM, mmcdole @.***> wrote: We also have our extension capability that parses arbitrary extensions. Perhaps we should get rid of the CustomElement idea and just tweak the extensions capability so it will parse anything, even non-namespaced extensions https://github.com/mmcdole/gofeed/blob/master/rss/parser.go#L144

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

rodmcelrath commented 1 year ago

Also possibly things that failed the valid rss test.  Example, had to pull a source from feed recently that was text only, no url.  Invalid rss so it didn’t come through.  Had to do a transformation on feed to make it look like an extension.  Then pass it through!Sent from my iPhoneOn Mar 25, 2023, at 8:29 PM, Rod McElrath @.> wrote:Yes!  This!Sent from my iPhoneOn Mar 25, 2023, at 5:08 PM, mmcdole @.> wrote: We also have our extension capability that parses arbitrary extensions. Perhaps we should get rid of the CustomElement idea and just tweak the extensions capability so it will parse anything, even non-namespaced extensions https://github.com/mmcdole/gofeed/blob/master/rss/parser.go#L144

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

rodmcelrath commented 1 year ago

I get it. Not strictly valid RSS.  Source with no url which is required.  If you let it through some people would be surprised.But pubdate is supposed to be a standard format. Literally no one observes the standard. But the code bends over backward to get something valid.RSS standard is practically a joke. Some of the things I have seen!Sent from my iPhoneOn Mar 25, 2023, at 8:29 PM, Rod McElrath @.> wrote:Yes!  This!Sent from my iPhoneOn Mar 25, 2023, at 5:08 PM, mmcdole @.> wrote: We also have our extension capability that parses arbitrary extensions. Perhaps we should get rid of the CustomElement idea and just tweak the extensions capability so it will parse anything, even non-namespaced extensions https://github.com/mmcdole/gofeed/blob/master/rss/parser.go#L144

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>