twitter / twitter-text

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.
https://developer.twitter.com/en/docs/counting-characters
Apache License 2.0
3.07k stars 520 forks source link

Wrong documentation of displayRange and validRange #286

Open swen128 opened 5 years ago

swen128 commented 5 years ago

READMEs in the Java, Objective-C and Ruby libraries incorrectly state that displayRange and validRange are "pairs of unicode code point indices". However, the actual implementations and the conformance test suite suggests that they are UTF-16 code unit indices.

One example can be found here:

text: "πŸ˜·πŸ‘ΎπŸ˜‘πŸ”₯πŸ’©"
expected:
    displayRangeStart: 0
    displayRangeEnd: 9
    validRangeStart: 0
    validRangeEnd: 9

Each emoji in the text consists of a single Unicode code point, thus Unicode length of the text is 5. On the other hand, as each emoji is represented by a surrogate pair in UTF-16 encoding, length of the UTF-16 code units is 10. This implies that the test case expects the parser to return UTF-16 ranges.

Furthermore, this JavaScript code calculates the displayRangeEnd using the String.length method, which, by the specification, counts UTF-16 code units.

I think either the documents or the parser API should be fixed for consistency.