kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.96k stars 342 forks source link

Atom feeds are incorrectly having their text elements entity decoded #67

Closed mmcdole closed 8 years ago

mmcdole commented 8 years ago

Take the following Atom feeds:

Explicit Text Type:

<feed xmlns="http://www.w3.org/2005/Atom">
  <title type="text">Example &lt;b&gt;Atom&lt;/b&gt;</title>
</feed>

Implicit Text Type:

<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Example &lt;b&gt;Atom&lt;/b&gt;</title>
</feed>

Feedparser will return the following for Title:

Example <b>Atom</b>

However, the Atom spec states that items with the type="text" attribute should be left as is. Indeed, this is the major difference between type="text" and type="html", where type="html" is explicit about containing escaped markup that should be decoded. If no type attribute exists, the element defaults to type="text".

So, to me, it seems as though feedparser is incorrectly entity-decoding Atom text elements. I think the correct logic would be something like (pseudo code):

if (feedType == "atom") 
    if (elem.Type == "" || elem.Type == "text")   
        return elem.CharData  
    else if (elem.Type == "html")   
       return decodeEntities(elem.CharData)  
    else if (elem.Type == "xhtml")   
       // ... Handle xhtml  

I've checked the output of Simplepie another feed parser, and it indeed returns what I consider the expected output for both the quoted feeds above: Example &lt;b&gt;Atom&lt;/b&gt;

mmcdole commented 8 years ago

Here is another source on the handling of Text elements on Atom feeds:

screen shot 2016-04-24 at 8 45 23 am
kurtmckee commented 8 years ago

Based on what you're describing feedparser is returning the correct text:

  1. Given the original title text: Use the <b> tag for bold
  2. The title must be XML encoded once to ensure the feed XML is valid and well-formed, and is: Use the &lt;b&gt; tag for bold
  3. feedparser returns Use the <b> tag for bold and this is the correct behavior.
  4. SimplePie incorrectly returns Use the &lt;b&gt; tag for bold

Based on what you're reporting, what you're seeing in SimplePie is a content-corruption bug and should be reported (unless SimplePie has established that it will always up-convert text entries to HTML). To get encoded entities back like you're stating SimplePie returns, the original text would already have to include encoded entities:

  1. Given the original title text: Use &lt; to display <
  2. The title must be XML encoded once: Use &amp;lt; to display &lt;
  3. feedparser returns Use &lt; to display < and this is the correct behavior.
  4. SimplePie incorrectly returns Use &amp;lt; to display &lt;

That said, many feeds screw up these simple rules and both feedparser and SimplePie must do their best to handle these mistakes. As a result, some feeds will trigger the code paths that attempt to fix broken/corrupt/invalid/ill-formed encodings/XML/HTML. It's a difficult balance to maintain.

Anyway, what you've reported above is correct behavior.

mmcdole commented 8 years ago

@kurtmckee both sources I've linked to disagree with your position.

The position you have taken, put into words is this:

"Despite the Atom feed explicitly stating that the content in the element is Text, and not entity-encoded html, Feedparser is going to entity-decode it anyway."

The quoted O'reilly book passage from RSS and Atom Feeds that I posted from the screenshot spells out fairly clearly how handle text content in Atom feeds, and how you are explicitly not supposed to entity-decode text elements.

I'm going to quote another source, an article from xml.org entitled Handling Atom Text and Content Constructs:

The problem is that an Atom processor should never second-guess the meaning of the type attribute, and since I implicitly use type="text" a processor will not interpret the contents as markup, as intended for the example.

To me, this is exactly what Feedparser has done, which is second guess the meaning of the type attribute, and what Simplepie has avoided doing. This is why I still believe that Feedparser is out of spec compliance, and Simplepie is not.

An argument could be made that Feedparser has made a design decision to second guess syntactically valid content, because users are still capable of producing syntactically valid feeds but still make mistakes like putting entity-encoded text in a Text element. But that doesn't mean that the result Feedparser produced was correct according to the intention of the Atom spec.

Whether or not that is a valid trade off is left entirely up to you. I just wanted to point this out to you that based on my reading of the Atom spec, the RSS and Atom Feeds book, and the article on handling Atom Text content, it all seemed to be at odds with what Feedparser was producing.

kurtmckee commented 8 years ago

@mmcdole It's necessary to escape the text so that the XML that it is enclosed in will remain valid and well-formed. Based on your reading of the texts it would be impossible to generate a valid XML document that includes a text element that includes an ampersand or a less-than sign. It appears that you are suggesting that no entity-encoding is allowed in a case like this:

<title type="text">Invalid XML < valid XML</title>
<title type="text">Invalid & unusable</title>

So, as I stated before, it is necessary to escape the text to ensure that it does not invalidate the XML it is embedded in. It doesn't matter whether it's XML, JSON, SQL, or even a Python string containing both quotation marks and apostrophes, it's always necessary to escape the content when embedding it in a container format.

There is an additional escaping mechanism: XML CDATA sections. However, this is still merely an XML escaping mechanism and has nothing to do with whether the contents are interpreted as text or HTML. When feedparser returns Example <b>Atom</b>, it is returning the original content that went into the feed pre-XML-escaping.

mmcdole commented 8 years ago

Based on your reading of the texts it would be impossible to generate a valid XML document that includes a text element that includes an ampersand or a less-than sign.

This is a really good point.

I guess I'm just left at a loss for why so many sources seem to go out of there way to talk about how you are explicitly not supposed to interpret text elements as encoded markup. That left me with the impression that you were supposed to take the element content as literal text with no decoding, as if it were in a CDATA section.

But I guess they are/were referring to how you interpret/display the content down the line? As in, if you were rendering the feed content or something?

Anyway, I cede my point. Thanks for hanging in there with me.

kurtmckee commented 8 years ago

To the best of my understanding, it's exactly to ensure that the content is interpreted/displayed correctly. HTML content doesn't require escaping when it's rendered, but text content does.

I mean, it's still extremely important to sanitize the HTML when it's rendered, but it doesn't need to be escaped. =)