mmcdole / gofeed

Parse RSS, Atom and JSON feeds in Go
MIT License
2.51k stars 204 forks source link

Atom: use correct xml:base for decoded elements #222

Closed cristoper closed 4 months ago

cristoper commented 4 months ago

The issue is that goxpp's DecodeElement() pop's the BaseStack after unmarshaling an element -- as it should to keep the BaseStack in sync with the current document scope. But that means the atom parser needs to keep a reference to the current base before calling DecodeElement() so that it can correctly resolve URLs.

Without this fix, elements with xml:base attributes will be erroneously resolved with the parent xml:base.

This fix is a little awkward with all of its saving and restoring of BaseStacks. A simpler solution would be to simply get the xml:base attribute from the decoded element and resolve against that. However, that would require changes to goxpp: either change XmlBaseResolveUrl() to accept a string rather than acting as a method, or maybe adding a separate ResolveUrl() function that takes a base and a relative URL that gofeed could use.

mmcdole commented 4 months ago

This fix is a little awkward with all of its saving and restoring of BaseStacks. A simpler solution would be to simply get the xml:base attribute from the decoded element and resolve against that. However, that would require changes to goxpp: either change XmlBaseResolveUrl() to accept a string rather than acting as a method, or maybe adding a separate ResolveUrl() function that takes a base and a relative URL that gofeed could use.

Maybe this makes sense @cristoper ?

You are suggesting that in goxpp we have:

func (p *XMLPullParser) ResolveUrl(baseURL, relativeUrl string) (string, error) {
    // Logic to resolve relativeUrl against baseURL + the current BaseStack URLs?
}

Then in Atom Parser, something like this?:

func (ap *Parser) parseAtomText(p *xpp.XMLPullParser) (string, error) {
    var text struct {
        Type     string `xml:"type,attr"`
        Mode     string `xml:"mode,attr"`
        Base     string `xml:"base,attr"` // Added to capture xml:base attribute
        InnerXML string `xml:",innerxml"`
    }

    // DecodeElement is used to unmarshal the element
    err := p.DecodeElement(&text)
    if err != nil {
        return "", err
    }

    // Use the captured xml:base if present; otherwise, use the current base URL
    baseURL := text.Base
    if baseURL == "" {
        baseURL = p.GetCurrentBaseURL() // Fallback to current base URL from the parser
    }

    // Resolve URLs in InnerXML using the determined base URL
    resolvedInnerXML, err := p.ResolveURL(baseURL, text.InnerXML)
    if err != nil {
        return "", err
    }

    return strings.TrimSpace(resolvedInnerXML), nil
}
cristoper commented 4 months ago

Yes, that's approximately what I had in mind. I just pushed an update to the PR which does similar but copies XmlBaseResolveUrl to gofeed so that we don't have to change goxpp.

However, if you'd prefer not duplicating code from goxpp we could instead make the change there (even by introducing a new function so we don't make any backward-incompatible change to the goxpp API).