pharo-contributions / XML-XMLParser

XML Parser for Pharo
MIT License
11 stars 17 forks source link

CRLF line ending is normalized in CDATA contents #9

Closed jvalteren closed 2 years ago

jvalteren commented 2 years ago

When parsing text contained in a <![CDATA[ ... ]]> element, the CDATA contents' line endings are normalized. CDATA contents (character data) should be left alone and not parsed/modified. Unfortunately, this aspect of CDATA sections doesn't seem to be honored by the parser.

I am using the XMLDOMParser to parse CalDAV 'calendar-query' responses containing iCalendar data. The iCalendar specification requires that this content must use CRLF line endings (see https://datatracker.ietf.org/doc/html/rfc5545#section-3.1). As it turns out, this is opposite to the current implementation of the XMLWellFormedParserTokenizer and XMLNestedStreamReader.

The problematic code is in XMLWellFormedParserTokenizer>>#nextCDataSection, where the message #next is sent to 'streamReader', which is an instance of XMLNestedStreamReader. It's implementation of #next normalizes CRLF line endings to LF, which is incorrect in this scenario.

jvalteren commented 2 years ago

I have devised a workaround, but I'm not sure if it is semantically correct. I'll create a pull request soon.

jvalteren commented 2 years ago

When looking into creating a test for this scenario, I came across the (already included) conformance tests. One specific test (valid -sa-116, see: https://www.w3.org/XML/Test/xmlconf-20020606.htm) covers this scenario and obviously fails with my workaround.

So... what to do?! The conformance test appears to indicate that the normalization of CRLF in CDATA sections is desired behavior. But for parsing iCalendar responses as part of a CalDAV query (i.e. embedded in an XML document), this does not work.

jvalteren commented 2 years ago

Haha, I feel like I'm talking to myself here :-)

TL;DR Coming full circle :-) The CalDAV specification actually mentions the normalization of CRLF (see: https://datatracker.ietf.org/doc/html/rfc4791#section-9.6):

Given that XML parsers normalize the two-character sequence CRLF (US-ASCII decimal 13 and US-ASCII decimal 10) to a single LF character (US-ASCII decimal 10), the CR character (US-ASCII decimal 13) MAY be omitted in calendar object resources specified in the CALDAV:calendar-data XML element.

So, the issue isn't with the XML parser ;-) I will try to convince the iCalendar implementation to ignore the omitted CR character.

Consider this issue solved.