w3c / css-validator

W3C CSS Validation Service
https://jigsaw.w3.org/css-validator/
Other
208 stars 105 forks source link

Make preprocessing of input stream handle supplementary characters #385

Closed sideshowbarker closed 1 year ago

sideshowbarker commented 1 year ago

Update: I now think we should do https://github.com/w3c/css-validator/pull/386 rather than making this change.

Fixes https://github.com/w3c/css-validator/issues/383. When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone (unpaired) surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.

I’ve tested this both in the context of the CSS validator itself running standalone and in the context of the HTML checker and found that it works as expected as far as replacing not replacing surrogates that are part of surrogate pairs.

However, I’ve not found a way to test that this code actually replaces lone (unpaired) surrogates as expected — because in the case of most encodings I tried testing with, Java’s internal encoding handling replaces the lone surrogates with U+FFFD before our input-stream-preprocessing implementation is ever run.

So I don’t know of any way to have lone surrogates passed through as-is from an input stream in such a way that our input-stream-preprocessing code would have a chance to replace them. Java always replaces them before our code runs.

ylafon commented 1 year ago

Used #386 instead as it is really the proper way (for now)