twitter / twitter-text

Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.
https://developer.twitter.com/en/docs/counting-characters
Apache License 2.0
3.07k stars 517 forks source link

Properly handle other languages for weightedLength #267

Open Manishearth opened 5 years ago

Manishearth commented 5 years ago

weightedLength is pretty naïve in the set of unicode ranges it uses, relegating large swaths of languages/scripts to double weighting for no evident reason.

Expected behavior

Languages like Khmer will have a per-codepoint weight of 100 when computing weightedLength. In other words, ក counts as 1 character when counting up to 280, and

កកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកក

is a valid tweet

Actual behavior

They have a per-codepoint weight of 200. ក counts as two characters, and the following tweet fills up the counter:

កកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកកក

The weightedLength API was introduced to handle counting characters for the 280 character limit, where Twitter's research indicated that CJK languages were "denser". This is a pretty valid metric to go off of, however the config does not implement this metric.

The config implements the following metric:

There are a couple of issues here:

Firstly, even if Korean is "dense", text from the Hangul Jamo block is not -- representing Hangul text with this block takes three times as many codepoints as the more commonly used Hangul Syllables block (most but IIRC not all IMEs for Korean construct precomposed Hangul Syllables). I guess this isn't overall a big deal since as far as I can tell this library NFCs the input anyway, at which point Jamo are turned into Syllables, but either way it's an incorrect choice of start point.

The more pressing issue here is that there are a ton of scripts after the Hangul Jamo block that aren't even CJK but get roped into this anyway. Khmer isn't very dense (I'd guess it would be as dense as most Indic scripts) but gets rolled in anyway. It has 16 million speakers. Ethiopic (slightly denser, but not too dense) covers a bunch of languages that add up to 30 million. There are a bunch more scripts in modern use that got unfortunately rolled up into this bundle.

Even within CJK, not all CJK is equal. Hiragana and Katakana text is far less dense than Kanji text, and typical Japanese tweets (that crop up on my timeline) contain a mix with widely varying proportions. It would be worth looking into the numbers with this in mind and perhaps assigning a different weight to Hiragana/Katakana. There's a similar dynamic with Bopomofo and Hanzi for Chinese, though I doubt folks tweet in Bopomofo as it's primarily a phonetic aid.

Furthermore, even within the unicode blocks already weighted as 100, you also have potential issues. Many Indic scripts require on an average two code points per "letter". These "letters" encode more information than Latin letters so it may make sense to weight them a bit more, but the current weighting may be too much. It's worth taking a closer look.

Similarly, some languages that use the Perso-Arabic script (e.g. Urdu) commonly use tashkil (the largely-optional vowel diacritics). Admittedly, I've not seen this happen as often on Twitter (perhaps due to character limit issues?) but I also don't have much Urdu crop up in my timeline, and am not a native speaker. I've definitely seen tashkil being used consistently online and offline for Urdu (but not, say, Arabic, except for religious texts). It's also worth taking a closer look here, perhaps ignoring tashkil code points if they follow non-diacritic arabic letters. Hebrew has similar optional diacritics but I'm not sure of how widespread the usage is.

Code points aren't "characters", they're a convenient abstraction that make sense in the context of unicode itself. I've written about this before.

Overall, the concept of "character length" is one that makes perfect sense from an ASCII/Latin standpoint, but gets fuzzier when you start thinking about other scripts. This project had a taste of this when improving its emoji handling, but the problems exhibited by the emoji handling stuff is really a more general problem that is also exhibited by many scripts, even if you fix the more glaring issue of the unicode ranges counting a lot of non-CJK as double. Counting grapheme clusters over code points is one fix, though you still have to investigate what counts as "dense" in that context (and protect against arbitrary-length grapheme clusters)


Ultimately tweet character length is a minor issue, but clearly this project cares about it -- I feel if character counts for skin tone emoji are being considered a problem big enough to fix; character counts for entire languages -- probably a bigger problem -- probably should be fixed too.

eKevinHoang commented 5 years ago

I got similar problem in Japanese. With the same text, on the twitter.com is fine but this library said that it is invalid because the weightedLength is 282.

I hope that this issue will be fixed soon.

Manishearth commented 5 years ago

To be clear, this issue is not about mismatches between twitter.com and the twitter-text library, it's about the behavior of both of them being incorrect for many scripts.

On Fri, Nov 2, 2018, 2:16 AM Kevin Hoang <notifications@github.com wrote:

I got similar problem in Japanese. With the same text, on the twitter.com is fine but this library said that it is invalid because the weightedLength is 282.

I hope that this issue will be fixed soon.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/twitter/twitter-text/issues/267#issuecomment-435317054, or mute the thread https://github.com/notifications/unsubscribe-auth/ABivSLlG8O1BXaQel5HS4r4NkuoZig4bks5urAzpgaJpZM4XW38W .

kaushlakers commented 5 years ago

@Manishearth thank you for the detailed description. Really appreciate you taking the time to write this up!

I think your concerns about languages other than CJK getting roped into this higher weighted treatment is definitely valid and worth discussing with the team.

Small clarification here:

Emoji count as 1 character

All emojis actually count as 2 characters because the related codepoints fall into the range that gets weighted as 2.

Regarding this:

Counting grapheme clusters over code points is one fix

We definitely considered taking this route, but it was quite complex to implement consistently and efficiently across the different platforms (objC, Java, JavaScript and Ruby) that are used in Twitter apps and backend services. The limited scope of Emoji (for now) made for a manageable fix with high gains.

Manishearth commented 5 years ago

a manageable fix with high gains.

Yeah you don't have to go the grapheme route, but you can at least put a some thought into the ranges so that scripts with no interesting property other than "appears after Korean in unicode" don't get hit. The library supports this already, you just have to identify the ranges and stick them in the json file.