lkiesow / python-feedgen

Python module to generate ATOM feeds, RSS feeds and Podcasts.
https://feedgen.kiesow.be/
BSD 2-Clause "Simplified" License
728 stars 122 forks source link

Fix issue where "CDATA" was incorrectly used as content type in atom entries #123

Open rpsh opened 1 year ago

rpsh commented 1 year ago

According to the Atom specification, the acceptable values for the type of <content> are text, html, and xhtml. This PR fixes the previous code that incorrectly used CDATA as the type value.

lkiesow commented 9 months ago

If you carefully read the specification, that actually not quite true:

In the most common case, the type attribute is either text, html, xhtml,

But just that these are the most common cases does not mean that these are the only acceptable values. In fact, the specification goes on describing other values right after that.

That being said, you are right that if you want the content to be enclosed by <![CDATA[...]]>, that doesn't say anything about the media type and that should probably be separated? Just stating that it is HTML is also not correct. I'm open for suggestions.

taesungh commented 8 months ago

While the specification does describe other acceptable values for the type attribute, the <content type="CDATA"> elements created by feedgen fail the W3C Feed Validation Service:

Not a valid MIME type: CDATA (Invalid MIME Type)

I found <content:encoded> is often used for RSS feeds, but I'm not sure what the best approach is for Atom.

michaelnordmeyer commented 7 months ago

Content type html quoted by <!CDATA[...]> is the proper way to do it, IMHO, if the content should be parsed as HTML.

From my understanding the type is a hint for clients to know the content's format, which only applies after parsing the XML.

<!CDATA[...]> is for parsers to ignore any content within.

The different content types are text, html, and xhtml. If the content has characters which break XML parsing, this has to be mitigated.

True xhtml content won't break XML parsing, because it's a SGML subset. This type cannot be used anymore, because most people use HTML5, which is not a subset of SGML.

html content can break XML parsing, because it's not a subset of SGML, and has to be XML-escaped or put within <!CDATA[...]>.

text is the same like html, because any text can have XML-breaking characters.

After the XML parser has parsed the feed, apps can parse the content according to the content type, which is only necessary for html and xhtml, in order to display it to the user.

Alkarex commented 7 months ago

The spec is clear: <content type="CDATA"> needs to be decoded as Base64 https://www.rfc-editor.org/rfc/rfc4287#section-4.1.3.3 For the record, we (at FreshRSS) are receiving reports of bad/wrong feeds, which I am guessing are likely produced by this library https://github.com/FreshRSS/FreshRSS/issues/6180 See also another example of wrong decoding with SimplePie, the library used e.g. by WordPress https://simplepie.org/demo/?feed=https%3A%2F%2Fpeterwunder.de%2Fprojects%2Fachievements%2Fatom.xml

Alkarex commented 6 months ago

@rpsh or anyone with write access to fix the conflicts here?