pContent only returns text elements inside of html content

jkeuhlen commented 5 years ago

It looks like EntryContent has an HTMLContent constructor that takes text, so the corresponding function pContent looks like it makes sense at first blush:

pContent :: XML.Element -> Maybe EntryContent
pContent e =
  case pAttr "type" e of
    Nothing -> return (TextContent (elementTexts e))
    Just "text" -> return (TextContent (elementTexts e))
    Just "html" -> return (HTMLContent (elementTexts e))
    Just "xhtml" ->
      case children e of
        [] -> return (TextContent "")
        [c] -> return (XHTMLContent c)
        _ -> Nothing
    Just ty ->
      case pAttr "src" e of
        Nothing -> return (MixedContent (Just ty) (elementNodes e))
        Just uri -> return (ExternalContent (Just ty) uri)

However, since it just casts the elements down to text, it loses a lot of information.

For example, this block of content just returns the empty string:

elContent = [Text (CData {cdVerbatim = CDataText, cdData = " ", cdLine = Just 1}),Elem (Element {elName = QName {qName = "img", qURI = Just "http://www.w3.org/2005/Atom", qPrefix = Nothing}, elAttribs = [Attr {attrKey = QName {qName = "src", qURI = Nothing, qPrefix = Nothing}, attrVal = "https://b.thumbs.redditmedia.com/Pyc5BzCOw0yx2L7unQyfPLe8Z_flrdvKkxrZFtrQsYU.jpg"},Attr {attrKey = QName {qName = "alt", qURI = Nothing, qPrefix = Nothing}, attrVal = "About 10 months ago I became a homeowner and quickly found out I love plants. This is now my happy place :)"},Attr {attrKey = QName {qName = "title", qURI = Nothing, qPrefix = Nothing}, attrVal = "About 10 months ago I became a homeowner and quickly found out I love plants. This is now my happy place :)"}], elContent = [], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = " ", cdLine = Just 1})]

And the corresponding chunk of xml content in a larger feed:

submitted by /u/zvtvllv
[link] [comments]

I'm happy to submit a pull request to fix this up, but I wanted to check with what the desired functionality is since the current state doesn't seem exactly right. Since HTMLContent wraps text, should it contain the unescaped HTML inside that content tag? Or should it be doing something else?

jkeuhlen commented 5 years ago

Github's too smart and pulled in the content for that raw xml chunk...

<content type="html"><table> <tr><td> <a href="https://www.reddit.com/r/gardening/comments/bqkvqp/about_10_months_ago_i_became_a_homeowner_and/"> <img src="https://b.thumbs.redditmedia.com/Pyc5BzCOw0yx2L7unQyfPLe8Z_flrdvKkxrZFtrQsYU.jpg" alt="About 10 months ago I became a homeowner and quickly found out I love plants. This is now my happy place :)" title="About 10 months ago I became a homeowner and quickly found out I love plants. This is now my happy place :)" /> </a> </td><td> submitted by <a href="https://www.reddit.com/user/zvtvllv"> /u/zvtvllv </a> <br/> <span><a href="https://i.redd.it/mbgnw371y7z21.jpg">[link]</a></span> <span><a href="https://www.reddit.com/r/gardening/comments/bqkvqp/about_10_months_ago_i_became_a_homeowner_and/">[comments]</a></span> </td></tr></table></content>

jkeuhlen commented 5 years ago

@bergmark Any thoughts on this issue?

bergmark commented 5 years ago

Sorry for being unresponsive @jkeuhlen, thanks for the reminder!

I'm not very familiar with this library, I took over maintenance because I needed it as a dependency at some point.

I can't think of what else to put here than the unescaped HTML, so I'd say go for it, would be happy to accept the patch.

jkeuhlen commented 5 years ago

Fair enough!

I'm not entirely sure what belongs in that field either and didn't want to go in the wrong direction with a patch that broke what the expectation was for the general library.

I'll plan on adding the raw HTML (flattened out of the XML elements) in that position and see if it breaks any tests or other expectations.

jkeuhlen commented 5 years ago

Actually, looking at this in more detail, does it make more sense to change the HTMLContent constructor to match XHTMLContent and wrap an Element? Semantically it seems kind of odd, but at this point in processing the HTML has been parsed as XML content already so it should be fully stable. Then we don't have to include some other XML->HTML dependency for reconstructing the raw html and can push that to the library consumer to use their favorite tool. @bergmark

bergmark commented 5 years ago

go for it! sounds good to me

bergmark commented 5 years ago

Fixed in #37

liskin commented 5 years ago

Folks, I'm quite convinced this is very wrong. See https://tools.ietf.org/html/rfc4287#section-4.1.3.3:

 2.  If the value of "type" is "html", the content of atom:content
     MUST NOT contain child elements and SHOULD be suitable for
     handling as HTML [HTML].  The HTML markup MUST be escaped; for
     example, "<br>" as "&lt;br>".  The HTML markup SHOULD be such
     that it could validly appear directly within an HTML <DIV>
     element.  Atom Processors that display the content MAY use the
     markup to aid in displaying it.

The example shown in earlier comments seems to come from reddit, but reddit does correctly escape the content inside of <content type="html">. I'm afraid you have an error somewhere else.

Can we please revert this? Thanks.

jkeuhlen commented 5 years ago

Is this causing you a problem specifically?

I'm open to being wrong, but I don't believe this change would have broken anything based on your description. At the heart, it makes the library more robust to people violating the RSS spec. But if it makes it so it can't parse something that is valid, we can definitely revert it.

liskin commented 5 years ago

Well, it just breaks compilation. I guess I can wrap my text in a TextContent before passing it to HTMLContent. I do wonder, however, whether there is a legitimate usecase for this. As I said, reddit (the one given as an example for this) does not violate the spec...

jkeuhlen commented 5 years ago

Okay so it was a breaking change for your project, sorry about that! After looking at it more in other contexts, the reddit rss actually breaks spec in a few places, which may have been the original issue I ran into here.

I'm still not sure this should be reverted though. Even if items should be "SHOULD be suitable for handling as HTML" that doesn't necessarily mean we should treat them as a raw text field. One of the historic problems with RSS is the lack of people actually conforming to the spec so ideally we could provide a little bit of flexibility for consumers. I think this change should be semantically equivalent to the old version in most cases (minus the constructor differences in actual code) but be able to capture more ill-formed feeds.

Happy to be convinced I'm wrong though!

Thoughts @liskin @bergmark ?

liskin commented 5 years ago

If supporting non-conforming RSS/Atom feeds is desirable, then this needs to stay, indeed. I do think it would be a good thing if there was a big fat warning that when producing (as opposed to consuming) feeds, HTMLContent should contain text with rendered HTML tags, not an actual (X)HTML tag tree.

liskin commented 5 years ago

Actually, looking at this in more detail, does it make more sense to change the HTMLContent constructor to match XHTMLContent and wrap an Element? Semantically it seems kind of odd, but at this point in processing the HTML has been parsed as XML content already so it should be fully stable. Then we don't have to include some other XML->HTML dependency for reconstructing the raw html and can push that to the library consumer to use their favorite tool. @bergmark

Um, now that I'm reading this again (inspired by https://github.com/bergmark/feed/pull/37#issuecomment-546619077) I'm pretty sure the answer here is a resounding “No.” Wrapping an Element (as opposed to wrapping a Node) really does prevent using this correctly, and if I understand it correctly, it even breaks parsing of correct inputs. Indeed, Text.Feed.Import.parseFeedFromFile "/tmp/haskell-weekly.atom" (https://haskellweekly.news/haskell-weekly.atom) in 1.2.0.1 returns entryContent = Just (TextContent "") instead of entryContent = Just (HTMLContent "<p>Welcome to another issue of Haskell Weekly!\n …. :-(

haskell-party / feed

pContent only returns text elements inside of html content #35