snoyberg / xml

Various XML utility packages for Haskell
71 stars 64 forks source link

Cannot pass raw ampersand '&' to parseText #182

Closed xave closed 1 year ago

xave commented 1 year ago

The objective is to give parseText a string "Apples & Oranges" without escaping the ampersand.

Using

...parseText def {psDecodeIllegalCharacters=passAmpersand}
where
passAmpersand :: Int -> Maybe Char
passAmpersand = \case
38 -> Just '&'
_ -> Nothing

does not work because ampersand is not an illegal character, thus failing to trigger the psDecodeIllegalCharacters.

A potential workaround is to parse my string "Apples & Oranges" and replace it with something outside of the range such as &#[0-9]+; as in the docs for psDecodeIllegalCharacters.

  1. I am unsure of how to represent &#[0-9]+; as an Int, where the output would be turned into Just '&'
  2. It would be ideal to just say raw ampersand is fine instead of (1).
k0ral commented 1 year ago

Unless I am mistaken, parseText is designed to consume valid XML data, which the string Apples & Oranges is not. See specification:

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively.

It seems to me that what you want is a pre-processing function that escapes & (and other special characters) before feeding the string to parseText. Example implementations (not type-checked, not tested):

Does that fulfill your need ?

xave commented 1 year ago

It does. Thank you very much!