Ignore linear whitespace in data and chunk element #PCDATA (#1003).

skynavga commented 5 years ago

Closes #1003.

nigelmegitt commented 5 years ago

@skynavga why the change of approach relative to https://github.com/w3c/ttml2/issues/1003#issuecomment-423246930 ?

skynavga commented 5 years ago

@nigelmegitt after reviewing http://www.rfc-editor.org/rfc/rfc4648.html#section-3.3, I determined that it was my original intent when drafting this functionality that only LWSP characters be ignored, which matches the current implementation of TTV; I don't see a good use case for permitting (but ignoring) non-alphabet characters that are not LWSP, and, indeed, there could be negative security implications of doing so, e.g., allowing but ignoring Unicode non-spacing characters, etc.

nigelmegitt commented 5 years ago

@skynavga I'm still a bit confused here. This means that after LWSP is ignored, the data can still contain characters prohibited by base64. Wouldn't that make it invalid? Looking at that section from RFC4648:

Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise. Such specifications may instead state, as MIME does, that characters outside the base encoding alphabet should simply be ignored when interpreting data ("be liberal in what you accept"). Note that this means that any adjacent carriage return/ line feed (CRLF) characters constitute "non-alphabet characters" and are ignored. Furthermore, such specifications MAY ignore the pad character, "=", treating it as non-alphabet data, if it is present before the end of the encoded data. If more than the allowed number of pad characters is found at the end of the string (e.g., a base 64 string terminated with "==="), the excess pad characters MAY also be ignored.

It looks like our reasonable choices are:

Specify characters (LWSP) to ignore, i.e. strip out before decoding, and that all remaining non-alphabet characters are prohibited (validating processor must reject); or
Do as is suggested for MIME, and require that all non-alphabet characters are ignored and stripped prior to decoding.

Otherwise are we not explicitly allowing the undesirable situation described in the same section of the RFC?:

Non-alphabet characters may be exploited as a "covert channel", where non-protocol data can be sent for nefarious purposes. Non-alphabet characters might also be sent in order to exploit implementation errors leading to, e.g., buffer overflow attacks.

skynavga commented 5 years ago

@nigelmegitt and I am proposing (in this PR) that we go with option #1, i.e., a validating processor may reject a document that contains non-LWSP, non-alphabet characters, which is precisely the behavior of TTV and TTPE at this time;

nigelmegitt commented 5 years ago

@skynavga that makes sense. I can't see anything that requires that the encoded data must be valid according to the specified encoding scheme, after stripping the LWSP - maybe there is some wording there that I have missed? If it isn't there, we should add it to this pull request, would you agree?

skynavga commented 5 years ago

@nigelmegitt I agree it is not clear whether the data must match the encoding, though one might infer it in the case of chunk which says:

If an encoding attribute is specified, then it must denote the actual encoding of the byte sequence represented by the chunk element. If no encoding attribute is specified, then the encoding must be considered to be base64.

however, the language under data is somewhat different, saying

If simple data embedding is used, i.e., the content of the data element is one or more text nodes, then an encoding attribute may be specified, and, if not specified, must be considered to be base64.

What we seem to be missing is language that says the encoded bytes must adhere to the specified or defaulted value (as the case may be) of the encoding attribute.

I will add language to address this.

nigelmegitt commented 5 years ago

Status update on this:

As per https://github.com/w3c/ttml2/pull/1131#issuecomment-511863528 @skynavga intends to add some further language
The tests at w3c/ttml2-tests#212 don't yet include any LWSP to demonstrate that the otherwise-valid embedded data remains valid even if LWSP is present.

skynavga commented 5 years ago

@nigelmegitt the following is incorrect

The tests at w3c/ttml2-tests#212 don't yet include any LWSP to demonstrate that the otherwise-valid embedded data remains valid even if LWSP is present.

see https://github.com/w3c/ttml2-tests/pull/212#issuecomment-516875934

nigelmegitt commented 5 years ago

@skynavga apologies for that, thanks for putting me right. I've approved the tests pull request now.

skynavga commented 5 years ago

@nigelmegitt perhaps you can approve this PR now as well?

nigelmegitt commented 5 years ago

I'm happy with the normative parts of this, but reflecting on my experience looking at the tests, it would be helpful to add an informative note around LWSP that, since numerical character references are resolved before processing the XML, any LWSP encoded as numerical character references are removed too. Would you be happy to add a note to that effect @skynavga ?

skynavga commented 5 years ago

@nigelmegitt not really; you are, in effect, asking me to paraphrase an extremely small subset of the meaning of "instantiate" as appears in §3.2.1

The processor provides at least one mechanism for notionally instantiating a reduced xml infoset representation of a conformant document instance.

it is not our task to explain how the concrete representation of an XML document is parsed, and doing what you ask is starting down that slippery slope

nigelmegitt commented 5 years ago

OK

skynavga commented 5 years ago

@nigelmegitt thanks

w3c / ttml2

Ignore linear whitespace in data and chunk element #PCDATA (#1003). #1131