otherjoel / splitflap

🔖⚛️ RSS / Atom feed generation library for Racket
Other
22 stars 2 forks source link

Correctly validate elements that accept `string?` or `txexpr?` #4

Closed otherjoel closed 2 years ago

otherjoel commented 2 years ago

Text content in feed elements must not use named HTML entities (except for the five that are valid in XML). If they contain HTML tags, the HTML must be escaped and (in the case of Atom) the appropriate type attribute set on the surrounding element. Further, values that will end up inside <![CDATA[...]]> must escape any occurrences of the string ]]> .

None of the above are checked, currently.

These are the elements that need better validation in this regard:

  1. feed-item-title and episode-title
  2. feed-name and podcast-name
  3. feed-item-content and episode-content

Each of these should accept xexpr? and be processed differently depending on whether the supplied value is a string or a tagged X-expression.

LiberalArtist commented 2 years ago

I've looked into this a bit more and confirmed the impression I'd had when I wrote on the mailing list.

Some of Apple's documentation seems to be written for a person who doesn't understand XML deep level and is writing markup by hand. The parts about <![CDATA[ … ]]> seem to fit into that category: it is arguably relatively easy to write by hand, but it's strictly more fragile when generating XML programmatically, and it's never required. A <![CDATA[ … ]]> block is simply a type of concrete syntax: at the level of the XML Infoset, "there is a character information item for each data character that appears in the document, whether literally, as a character reference, or within a CDATA section." (§ 2.6 of the spec), and the information set does not include "whether characters are represented by character references" or "the boundaries of CDATA marked sections" (Appendix D).

Specifically, it looks like elements like description should, at the RSS/Atom/XML level, contain character data. This character data should in turn correspond to a fragment of HTML—so, the HTML is doubly encoded. This also means its important to keep track of which XML elements are expected to contain serialized HTML and which are expected to contain plain text, without markup: when podcasts get this wrong, you end up seeing &amp; in your GUI rather than &.

For a concrete example, I looked at https://fossandcrafts.org/rss-feed.rss (hi @cwebber & @mlemmer!), which I happen to know looks right in Apple Podcasts. Here is a bit of the description for their episode 36 (I shortened this by hand for illustrative purposes):

<description>&lt;p&gt;Lightning round!  Morgan and Christine blast through a bunch of
snack-sized topics they're currently interested in, ranging from an
actual FOSS video game made for the NES, to &amp;quot;Free Soft Wear&amp;quot; clothing,
to compiler towers!&lt;/p&gt;&lt;p&gt;&lt;img src=&quot;https://mlemmer.org/images/IMG_20210821_120040scaled.jpg&quot; alt=&quot;Free Soft Wear tag&quot; /&gt;&lt;/p&gt;&lt;p&gt;&lt;em&gt;above image from &lt;a href=&quot;https://mlemmer.org/blog/free_soft_wear/&quot;&gt;Morgan's blogpost on &amp;quot;free soft wear&amp;quot;&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;</description>

This might be generated by something like:

#lang racket
(require xml)
(write-xexpr
 `(description
   ,(string-append*
     (map xexpr->string
          `[(p "Lightning round!  Morgan and Christine blast through a bunch of\n"
               "snack-sized topics they're currently interested in, ranging from an\n"
               "actual FOSS video game made for the NES, to \"Free Soft Wear\" clothing,\n"
               "to compiler towers!")
            (p (img ([src "https://mlemmer.org/images/IMG_20210821_120040scaled.jpg"]
                     [alt "Free Soft Wear tag"])
                    " "))
            (p (em "above image from "
                   (a ([href "https://mlemmer.org/blog/free_soft_wear/"])
                      "Morgan's blogpost on \"free soft wear\"")))
            (p (strong "Links:"))]))))

The only difference in the output (the Foss & Crafts feed actually is generated with Guile and Haunt) is that they replace " with &quot;. I'm not sure whether Apple's requirements really mean to forbid " from syntactically appearing in character data (other than in attribute values) or simply mean that you must not use the HTML entity &ldquo;. I suspect it's the latter, because:

  1. &ldquo; is defined in HTML as U+0201C LEFT DOUBLE QUOTATION MARK (i.e. “, a curly quote), which is a totally different character than the built-in XML entity &quot;, U+00022 QUOTATION MARK (i.e. ", the straight ASCII quote).
  2. Apple uses libxml2 pretty pervasively for XML parsing, so it would take specific effort on their part to enforce extra requirements on concrete syntax. (This is the less persuasive point, because Apple certainly is capable of imposing silly, arbitrary extra requirements if they decide they want to.)

According to the WHATWG spec (which links to this more detailed rationale), HTML entities (or "named character references", in WHATWG parlance) are basically a legacy compatibility mechanism and won't be expanded on further—you can always use a numeric character reference or just write Unicode—so restricting them isn't a great loss.

LiberalArtist commented 2 years ago

Oh, I just remembered that I'd observed in https://github.com/racket/racket/issues/2440 that iTunes would generate XML plist files with e.g. <string>Gospel &#38; Religious</string> (rather than &amp;, as https://podcasters.apple.com/support/823-podcast-requirements instructs), which seems like further evidence that their requirement is really "produce valid XML", not the idiosyncratic list of things that webpage implies they require.

otherjoel commented 2 years ago

Yeah, all of this jives.

This also means its important to keep track of which XML elements are expected to contain serialized HTML and which are expected to contain plain text, without markup

The Atom spec is both more flexible and extremely explicit about this kind of thing, while RSS is flaky and vague, so RSS will probably dictate the options to some extent.

RSS says <description> can contain HTML if it is escaped, like we see in the feed you linked.

For feed-level <title> tags, the RSS spec says only “If you have an HTML website that contains the same information as your RSS file, the title of your channel should be the same as the title of your website." Which seems to indicate that the title should match the contents of the <title> tag on your website, meaning no HTML.

For item-level <title> tags, the RSS spec seems to be completely silent on whether escaped HTML is allowed.

otherjoel commented 2 years ago

Where does this leave us? It seems clear that we can just leave Unicode characters like © etc. alone. And I think named character references (outside of the “XML Five”) should be either replaced with their numeric equivalents, or neutered by escaping the initial ampersand.

Here’s one possible strategy:

Titles: Allow only strings (not tagged X-expressions).

Content: Allow any X-expression.

I do want to leave users the option of CDATA, for convenience. I don’t see any practical downside to it.

Since the escaping approach is going to be uniform across both RSS and Atom, I’ll probably move the escaping so it happens during validation, at the point where the values are supplied, not at the point when the XML string is generated.