mmcdole / gofeed

Parse RSS, Atom and JSON feeds in Go
MIT License
2.56k stars 208 forks source link

corrupted / mangled nested custom XML #203

Open KyleSanderson opened 1 year ago

KyleSanderson commented 1 year ago

Expected behavior

item.Custom["torrent"] returns the string contents of the XML.

Actual behavior

{"level":"error","module":"feed","feed":"emp","error":"XML syntax error on line 1: element <torrent> closed by </contentLengthHR>","time":"2023-03-14T21:46:39+01:00","message":"could not unmarshal item.Custom.Torrent"}

String returned: Cat.torrentA%2B%10%D5l%F2%BC%B3%F3%5E%D1%14qM%15%F2%26%C4%F0Z\n\t\t\t\t<contentLengthHR>58.34GiB</contentLengthHR>

Steps to reproduce the behavior

        <item>
            <title><![CDATA[[SomeData]]></title>
            <torrent xmlns="http://xmlns.ezrss.it/0.1/">
                <fileName><![CDATA[Cat.torrent]]></fileName>
                <infoHash><![CDATA[A%2B%10%D5l%F2%BC%B3%F3%5E%D1%14qM%15%F2%26%C4%F0Z]]></infoHash>
                <contentLength>62643697701</contentLength>
                <contentLengthHR>58.34 GiB</contentLengthHR>
            </torrent>
        </item>

Note: Please include any links to problem feeds, or the feed content itself!

I don't have access to the site where the user reported the issue from, but this is the feed item from there. Allegedly this format is popular for a group of sites.

mmcdole commented 1 year ago

Yea, I see the issue with how we are calling parseText when we encounter an unknown tag, which is attempting to decode the element, which will fail if the the element has nested structure. The handling of this seems pretty insufficient.

I've created a feature enhancement here with #206 to discuss how to handle this properly.