Closed jkeuhlen closed 5 years ago
Github's too smart and pulled in the content for that raw xml chunk...
<content type="html"><table> <tr><td> <a href="https://www.reddit.com/r/gardening/comments/bqkvqp/about_10_months_ago_i_became_a_homeowner_and/"> <img src="https://b.thumbs.redditmedia.com/Pyc5BzCOw0yx2L7unQyfPLe8Z_flrdvKkxrZFtrQsYU.jpg" alt="About 10 months ago I became a homeowner and quickly found out I love plants. This is now my happy place :)" title="About 10 months ago I became a homeowner and quickly found out I love plants. This is now my happy place :)" /> </a> </td><td> submitted by <a href="https://www.reddit.com/user/zvtvllv"> /u/zvtvllv </a> <br/> <span><a href="https://i.redd.it/mbgnw371y7z21.jpg">[link]</a></span> <span><a href="https://www.reddit.com/r/gardening/comments/bqkvqp/about_10_months_ago_i_became_a_homeowner_and/">[comments]</a></span> </td></tr></table></content>
@bergmark Any thoughts on this issue?
Sorry for being unresponsive @jkeuhlen, thanks for the reminder!
I'm not very familiar with this library, I took over maintenance because I needed it as a dependency at some point.
I can't think of what else to put here than the unescaped HTML, so I'd say go for it, would be happy to accept the patch.
Fair enough!
I'm not entirely sure what belongs in that field either and didn't want to go in the wrong direction with a patch that broke what the expectation was for the general library.
I'll plan on adding the raw HTML (flattened out of the XML elements) in that position and see if it breaks any tests or other expectations.
Actually, looking at this in more detail, does it make more sense to change the HTMLContent
constructor to match XHTMLContent
and wrap an Element
? Semantically it seems kind of odd, but at this point in processing the HTML has been parsed as XML content already so it should be fully stable. Then we don't have to include some other XML->HTML dependency for reconstructing the raw html and can push that to the library consumer to use their favorite tool. @bergmark
go for it! sounds good to me
Fixed in #37
Folks, I'm quite convinced this is very wrong. See https://tools.ietf.org/html/rfc4287#section-4.1.3.3:
2. If the value of "type" is "html", the content of atom:content
MUST NOT contain child elements and SHOULD be suitable for
handling as HTML [HTML]. The HTML markup MUST be escaped; for
example, "<br>" as "<br>". The HTML markup SHOULD be such
that it could validly appear directly within an HTML <DIV>
element. Atom Processors that display the content MAY use the
markup to aid in displaying it.
The example shown in earlier comments seems to come from reddit, but reddit does correctly escape the content inside of <content type="html">
. I'm afraid you have an error somewhere else.
Can we please revert this? Thanks.
Is this causing you a problem specifically?
I'm open to being wrong, but I don't believe this change would have broken anything based on your description. At the heart, it makes the library more robust to people violating the RSS spec. But if it makes it so it can't parse something that is valid, we can definitely revert it.
Well, it just breaks compilation. I guess I can wrap my text in a TextContent
before passing it to HTMLContent
. I do wonder, however, whether there is a legitimate usecase for this. As I said, reddit (the one given as an example for this) does not violate the spec...
Okay so it was a breaking change for your project, sorry about that! After looking at it more in other contexts, the reddit rss actually breaks spec in a few places, which may have been the original issue I ran into here.
I'm still not sure this should be reverted though. Even if items should be "SHOULD be suitable for handling as HTML" that doesn't necessarily mean we should treat them as a raw text field. One of the historic problems with RSS is the lack of people actually conforming to the spec so ideally we could provide a little bit of flexibility for consumers. I think this change should be semantically equivalent to the old version in most cases (minus the constructor differences in actual code) but be able to capture more ill-formed feeds.
Happy to be convinced I'm wrong though!
Thoughts @liskin @bergmark ?
If supporting non-conforming RSS/Atom feeds is desirable, then this needs to stay, indeed.
I do think it would be a good thing if there was a big fat warning that when producing (as opposed to consuming) feeds, HTMLContent
should contain text with rendered HTML tags, not an actual (X)HTML tag tree.
Actually, looking at this in more detail, does it make more sense to change the HTMLContent constructor to match XHTMLContent and wrap an Element? Semantically it seems kind of odd, but at this point in processing the HTML has been parsed as XML content already so it should be fully stable. Then we don't have to include some other XML->HTML dependency for reconstructing the raw html and can push that to the library consumer to use their favorite tool. @bergmark
Um, now that I'm reading this again (inspired by https://github.com/bergmark/feed/pull/37#issuecomment-546619077) I'm pretty sure the answer here is a resounding “No.” Wrapping an Element (as opposed to wrapping a Node) really does prevent using this correctly, and if I understand it correctly, it even breaks parsing of correct inputs. Indeed, Text.Feed.Import.parseFeedFromFile "/tmp/haskell-weekly.atom"
(https://haskellweekly.news/haskell-weekly.atom) in 1.2.0.1 returns entryContent = Just (TextContent "")
instead of entryContent = Just (HTMLContent "<p>Welcome to another issue of Haskell Weekly!\n …
. :-(
It looks like
EntryContent
has anHTMLContent
constructor that takes text, so the corresponding functionpContent
looks like it makes sense at first blush:However, since it just casts the elements down to text, it loses a lot of information.
For example, this block of content just returns the empty string:
And the corresponding chunk of xml content in a larger feed:
[link] [comments]
I'm happy to submit a pull request to fix this up, but I wanted to check with what the desired functionality is since the current state doesn't seem exactly right. Since HTMLContent wraps text, should it contain the unescaped HTML inside that content tag? Or should it be doing something else?