w3c / xslt30-test

24 stars 14 forks source link

Unicode normalization of test files #13

Closed michaelhkay closed 4 years ago

michaelhkay commented 4 years ago

I believe that some recent commits of files may have changed the Unicode normalization form of selected characters, causing tests to fail where they did not fail before. For example, Saxon 9.8 fails test output-0141 when run against the current GitHub version of the test, but succeeds when run against the W3C version, despite the fact that there has been no visible or documented change to the test case; close inspection suggests that the string http://iri.example.org/ﭏ/årsrapport/år/2005?x=y is represented by a different sequence of codepoints in the two repositories.

In fact it appears to be the intent of the test that the two occurrences of "år" are in fact different Unicode codepoint sequences, one of which is URI-escaped as %C3%A5r, the other as a%CC%8Ar. In codepoint terms, the first is (229, 114) ("small letter A with ring above" followed by "r" -- composed form), the second is (97, 778, 114) ("a" followed by "combining ring above" followed by "r" -- decomposed form); and in the serialization assertions in the test-set file, the composed form seems to have been unintentionally replaced by the decomposed form.

Rather than attempting to restore the previous state, it would be better to ensure that the situation does not arise again, by using numeric character references for the relevant characters. This is easier to achieve in the test case itself than in the metadata test assertions, which typically use CDATA.

michaelhkay commented 4 years ago

Fixed this by using numeric character references where appropriate