Figure out what we're doing with ICU tokenisation and locales

reivilibre commented 1 year ago

The ICU tokenisation rules seem to vary on different platforms.

Is it the ICU version? The locale? (How does ICU even get a default locale? I had a quick spelunk in the source code and couldn't find it!)

We need to figure out:

what we actually want from the ICU library
how we get that
how we get consistent results.

It feels like we want a 'universal' locale independent of the host's settings, so that Synapse works well with all languages. (This may be a pie in the sky goal!) What's the best we can do?

This issue was originally dug up in #15079, but e.g. Patrick's machine generates another tokenisation yet again. I'm not satisfied with the current solution..

clokep commented 1 year ago

For reference I'm getting the following error on UserDirectoryICUTestCase.test_icu_word_boundary_punctuation:

Traceback (most recent call last):
  File "synapse/tests/storage/test_user_directory.py", line 559, in test_icu_word_boundary_punctuation
    self.assertIn(
  File ".env/twisted/trial/_synctest.py", line 506, in assertIn
    raise self.failureException(msg or f"{containee!r} not in {container!r}")
twisted.trial.unittest.FailTest: ["lazy'fox", 'jumped', 'over', 'the.dog'] not in (["lazy'fox", 'jumped', 'over', 'the', 'dog'], ["lazy'fox", 'jumped:over', 'the.dog'])

deepbluev7 commented 1 year ago

I think the change in colon behaviour was actually added by Apple for ICU 72.1: https://unicode-org.atlassian.net/browse/ICU-22112

So possibly that will break in future Ubuntu versions too :)

reivilibre commented 1 year ago

I'm somewhat confused here; I tried segmenting this snippet on 3 versions of Ubuntu (3 versions of ICU) and they all gave the same result. From this, we might hope to conclude that ICU version is not what affects it.

But I tried playing around with locales and could not invite a change that way either.

print(_parse_words_with_icu("lazy'dog jumped:over the.fox 授業は八時三十分から始まるから。"))

-----
Ubuntu 20.04
ICU  66.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 22.04
ICU  70.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 23.04
ICU  72.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']

The above were containerised (docker) installs.

My laptop has Kubuntu 22.10 and says

ICU 71.1
["lazy'dog", 'jumped:over', 'the.fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']

>>> [f"{k}={v}" for k, v in os.environ.items() if k.startswith("LC_")]
['LC_MONETARY=en_GB.UTF-8', 'LC_MEASUREMENT=fr_FR.UTF-8', 'LC_CTYPE=en_GB.UTF-8', 'LC_TIME=en_GB.UTF-8', 'LC_COLLATE=en_GB.UTF-8', 'LC_NUMERIC=en_GB.UTF-8']

In an Ubuntu docker container, I get

>>> [f"{k}={v}" for k, v in os.environ.items() if k.startswith("LC_")]
['LC_CTYPE=C.UTF-8']

Even if I change the LC_ env vars on my laptop to just that, I get the same tokenisation. (I also tried using different methods on Locale to see if I could get a different Locale from within PyICU. Unless it's just this one version of ICU that has the difference? I guess I should try that one in docker as well) EDIT: ICU 71.1 in a container acted the same as every other ICU version in a container...

deepbluev7 commented 1 year ago

Huh, interesting, because I did see that change on their github, but possibly that is a different library? I probably mixed something up then, sorry!

matrix-org / synapse

Figure out what we're doing with ICU tokenisation and locales #15124