Fix handling of non-matching surrogates in collation data.

unicode-org / conformance

Unicode & CLDR Data Driven Testing

https://unicode-org.github.io/conformance/

Other

4 stars 12 forks source link

Fix handling of non-matching surrogates in collation data. #147

Open sven-oly opened 9 months ago

sven-oly commented 9 months ago

The current test generator doesn't create tests for collation data when either of the test strings contains an incomplete surrogate. These are recorded in the logging files but they are not stored in any data or mentioned in any dashboards.

sffc commented 3 months ago

@markusicu How important is it to test unpaired surrogate collation behavior?

markusicu commented 3 months ago

https://www.unicode.org/Public/UCA/latest/CollationTest.html

“These files contain test cases that include ill-formed strings, with surrogate code points. Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines in the test cases, before testing for conformance.”

sffc commented 2 weeks ago

A key problem here is that unpaired surrogates cannot be represented in UTF-8 (they can be in WTF-8). I feel like I'm not super interested in testing this corner of the conformance data for collation and we should just limit our testing to things that are valid in UTF-8.

markusicu commented 2 weeks ago

That's fine. Did you see my reply from jun05?

sffc commented 2 weeks ago

That's fine. Did you see my reply from jun05?

Yes I did, and it seems like this is the current behavior.

But, the conformance data contains unpaired surrogates presumably because in environments that support them, they need to have a certain behavior, right? So it seems like unicode-org/conformance should pass them down to executors that represent implementations that handle them.

So, I propose keeping this issue open, but demoting the priority.