Open reivilibre opened 1 year ago
For reference I'm getting the following error on UserDirectoryICUTestCase.test_icu_word_boundary_punctuation
:
Traceback (most recent call last):
File "synapse/tests/storage/test_user_directory.py", line 559, in test_icu_word_boundary_punctuation
self.assertIn(
File ".env/twisted/trial/_synctest.py", line 506, in assertIn
raise self.failureException(msg or f"{containee!r} not in {container!r}")
twisted.trial.unittest.FailTest: ["lazy'fox", 'jumped', 'over', 'the.dog'] not in (["lazy'fox", 'jumped', 'over', 'the', 'dog'], ["lazy'fox", 'jumped:over', 'the.dog'])
I think the change in colon behaviour was actually added by Apple for ICU 72.1: https://unicode-org.atlassian.net/browse/ICU-22112
So possibly that will break in future Ubuntu versions too :)
I'm somewhat confused here; I tried segmenting this snippet on 3 versions of Ubuntu (3 versions of ICU) and they all gave the same result. From this, we might hope to conclude that ICU version is not what affects it.
But I tried playing around with locales and could not invite a change that way either.
print(_parse_words_with_icu("lazy'dog jumped:over the.fox 授業は八時三十分から始まるから。"))
-----
Ubuntu 20.04
ICU 66.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 22.04
ICU 70.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
-----
Ubuntu 23.04
ICU 72.1
["lazy'dog", 'jumped', 'over', 'the', 'fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
The above were containerised (docker) installs.
My laptop has Kubuntu 22.10 and says
ICU 71.1
["lazy'dog", 'jumped:over', 'the.fox', '授業', 'は', '八時', '三', '十分', 'から', '始まる', 'から']
>>> [f"{k}={v}" for k, v in os.environ.items() if k.startswith("LC_")]
['LC_MONETARY=en_GB.UTF-8', 'LC_MEASUREMENT=fr_FR.UTF-8', 'LC_CTYPE=en_GB.UTF-8', 'LC_TIME=en_GB.UTF-8', 'LC_COLLATE=en_GB.UTF-8', 'LC_NUMERIC=en_GB.UTF-8']
In an Ubuntu docker
container, I get
>>> [f"{k}={v}" for k, v in os.environ.items() if k.startswith("LC_")]
['LC_CTYPE=C.UTF-8']
Even if I change the LC_
env vars on my laptop to just that, I get the same tokenisation. (I also tried using different methods on Locale
to see if I could get a different Locale from within PyICU. Unless it's just this one version of ICU that has the difference? I guess I should try that one in docker as well)
EDIT: ICU 71.1 in a container acted the same as every other ICU version in a container...
Huh, interesting, because I did see that change on their github, but possibly that is a different library? I probably mixed something up then, sorry!
The ICU tokenisation rules seem to vary on different platforms.
Is it the ICU version? The locale? (How does ICU even get a default locale? I had a quick spelunk in the source code and couldn't find it!)
We need to figure out:
It feels like we want a 'universal' locale independent of the host's settings, so that Synapse works well with all languages. (This may be a pie in the sky goal!) What's the best we can do?
This issue was originally dug up in #15079, but e.g. Patrick's machine generates another tokenisation yet again. I'm not satisfied with the current solution..