Make preprocessing of input stream handle supplementary characters

Update: I now think we should do https://github.com/w3c/css-validator/pull/386 rather than making this change.

Fixes https://github.com/w3c/css-validator/issues/383. When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone (unpaired) surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.

I’ve tested this both in the context of the CSS validator itself running standalone and in the context of the HTML checker and found that it works as expected as far as replacing not replacing surrogates that are part of surrogate pairs.

However, I’ve not found a way to test that this code actually replaces lone (unpaired) surrogates as expected — because in the case of most encodings I tried testing with, Java’s internal encoding handling replaces the lone surrogates with U+FFFD before our input-stream-preprocessing implementation is ever run.

So I don’t know of any way to have lone surrogates passed through as-is from an input stream in such a way that our input-stream-preprocessing code would have a chance to replace them. Java always replaces them before our code runs.

w3c / css-validator

Make preprocessing of input stream handle supplementary characters #385