matrix-org / matrix-hookshot

A bridge between Matrix and multiple project management services, such as GitHub, GitLab and JIRA.
https://matrix-org.github.io/matrix-hookshot/
Apache License 2.0
293 stars 68 forks source link

rss widget parsing <content> and <summary> with html in them as xml/xhtml #686

Closed rrix closed 1 year ago

rrix commented 1 year ago

@Half-Shot tried to resolve this in 61f25fae3611b753cb7ac07a1f2f41942780b83d but this unfurled in to more trouble.

I have a feed which does not parse:

Error fetching https://arcology.garden/updates.xml: Attribute without value
Line: 625
Column: 239
Char: >

That line is a youtube copy-pasta embed name of <iframe width="560" height="315" src="https://www.youtube.com/embed/V9vQ06n7HX4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> -- it doesn't like the allowfullscreen attribute

W3C feed validator says it's more-or-less valid and while i could probably coerce pandoc in to spitting out xhtml for the feeds, i'd rather not. Meanwhile, a close reading of RFC4287 s3.1.1.2 would reveal that this feed isn't actually valid and that I should be escaping the &lt; elements so they escape the parser.

I'll do that, maybe this weekend, but this feed worked fine with go-neb so it caught me a bit by surprise.

In general I think feed readers should behave more like browsers in trying to handle parsing and presentation errors with the goal of having a document to present even if it's not "valid"

cheers

tadzik commented 1 year ago

Yeah, my understanding is that any invalid XML in a feed should be either escaped as you say, or stuffed in a <![CDATA[]]> so that the parser skips over it.

Having tried your feed in the w3c validator it actually fails for all sorts of reasons: https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Farcology.garden%2Fupdates.xml, starting at duplicate id attribute in some entries.

If something is good enough for the W3C validator but failing in hookshot I'd consider that a bug – but this is clearly invalid XML, RSS standards aside.

Half-Shot commented 1 year ago

Yeah, I think given we're seeing very low rates of parsing failures in hookshot (about 0.401%, well within the margin of error) I think we're likely relaxed enough about our parsing. I think if W3C starts saying your feed is valid @rrix and hookshot still complains, we can poke at it.

rrix commented 1 year ago

it's HTTP 200 Okay, not HTTP 200 Perfect, but okay lol

rrix commented 1 year ago

i'll just go back to running my own bot which parses the feeds just fine, but thanks for looking :)