READMEs in the Java, Objective-C and Ruby libraries incorrectly state that displayRange and validRange are "pairs of unicode code point indices".
However, the actual implementations and the conformance test suite suggests that they are UTF-16 code unit indices.
Each emoji in the text consists of a single Unicode code point, thus Unicode length of the text is 5.
On the other hand, as each emoji is represented by a surrogate pair in UTF-16 encoding, length of the UTF-16 code units is 10.
This implies that the test case expects the parser to return UTF-16 ranges.
Furthermore, this JavaScript code calculates the displayRangeEnd using the String.length method, which, by the specification, counts UTF-16 code units.
I think either the documents or the parser API should be fixed for consistency.
READMEs in the Java, Objective-C and Ruby libraries incorrectly state that
displayRange
andvalidRange
are "pairs of unicode code point indices". However, the actual implementations and the conformance test suite suggests that they are UTF-16 code unit indices.One example can be found here:
Each emoji in the text consists of a single Unicode code point, thus Unicode length of the text is 5. On the other hand, as each emoji is represented by a surrogate pair in UTF-16 encoding, length of the UTF-16 code units is 10. This implies that the test case expects the parser to return UTF-16 ranges.
Furthermore, this JavaScript code calculates the
displayRangeEnd
using theString.length
method, which, by the specification, counts UTF-16 code units.I think either the documents or the parser API should be fixed for consistency.