google / budoux

https://google.github.io/budoux/
Apache License 2.0
1.45k stars 32 forks source link

Clarification on the unicode used on the keys #504

Open Cloudef opened 9 months ago

Cloudef commented 9 months ago

Hello, I'm porting budoux to zig (and C). I wonder what exactly is the unicode format of the keys in the model. Are they normalized? When constructing a key should the key be constructed from range of codepoints or graphemes? I'm asking because currently all the tests pass expect for the thai model and this might be the issue I'm hitting.

Cloudef commented 9 months ago

Looking at the java code, the budoux simply does substring. Java internally stores strings utf16 so I guess these are neither codepoints or graphemes but rather utf16 slices. This also explains why there is some garbage keys in the thai models. I think this should be fixed as it may also impact accuracy.

Cloudef commented 9 months ago

https://github.com/Cloudef/zig-budoux I decided to not support the thai model for the time being

Cloudef commented 9 months ago

How i would fix this:

Grapheme aware would give the most accurate results as grapheme is what we humans think of as a character. However graphemes may increase the code size of the runtime. Codepoint is good tradeoff as they wont have the utf16 soup problem and the code required to iterate codepoints is light. Normalization is not required, but it standardizes the keys and reduces model size. If you want budoux to work with any unicode input, then you could normalize the training data to all forms which will increase the model size.

The runtime would have to then normalize the inputs (or keys) to nfkc as well for the best result.

Casefolding is something that is usually done before normalization when building indices, but i dont think its required for budoux as i cant think of any language that has upper/lower case and no word boundaries.

Related: http://utf8everywhere.org/ https://mortoray.com/the-string-type-is-broken/

tushuhei commented 9 months ago

Hi @Cloudef,

Thanks for your insightful feedback! The keys are meant to be a list of codepoints, rather than grapheme clusters.

Normalization

It's a good point. Definitely it would be better to normalize the training data before inputting them into the training pipeline. I believe it should be done fairly easily by tweaking our encode_data.py, where we extract keys from the input data. The key extraction is done in Python3 and str indexing + concatenating, so I believe it's working as intended (i.e. codepoint-aware).

Thai model

I'm not an expert in Thai, but given that some diacritical marks such as U+0E37 are included in the learned model, it seems some diacritical marks are playing an important role in splitting sentences. It's a good thing because it means that the model learned a more generic rule that can be applied to any character that has the diacritical mark, which leads to a smaller model file size eventually. I'm not an expert in C/Zip either, but could you elaborate the problem when you're facing in adopting codepoint-aware keys instead of grapheme-aware keys?

UTF-16

For JavaScript and Java ports, I agree that we need to adopt the approaches which make sure that characters are treated codepoint-basis instead of UTF-16 basis. We should add some test cases with characters of multiple 16-bit code units.

BTW, your zig-budoux work looks awesome! Congrats on the launch :)

@allensu05 Does the ICU port also presumes that the model keys are codepoint-basis? Do you have any insights on this issue? @kojiishi for vis

Cloudef commented 9 months ago

Hello, thank you for your response.

Regarding graphemes vs codepoints, as long as the intent is clear there is no problem. Normalization is definitely something that should be considered.

There might be a bug in my implementation also. I took the hexdump of utf16 encoded string of วันนี้อากาศดี and all the values seem to fall to the unicode BMP range, however the logic will fail if there's any low or high surrogate pairs (at least for java and javascript, I'm not sure about python's status quo), if the keys never can have codepoints outside of BMP then there is no problem of course.

The test in question is this:

var iter = model.iterator("วันนี้อากาศดี");
try std.testing.expectEqualSlices(u8, "วัน", iter.next().?);
try std.testing.expectEqualSlices(u8, "นี้", iter.next().?);
try std.testing.expectEqualSlices(u8, "อากาศ", iter.next().?);
try std.testing.expectEqualSlices(u8, "ดี", iter.next().?);

The last condition fails, though the segmenter correctly seems to check for , , ดี, ศดี, etc ... (that is every hash map key is a valid utf8 string)

I ran this on both the utf8 and utf16 output and I get the same codepoints, however interestingly this program's output also seems slightly broken, so maybe there's some care that needs to be taken even with python strings.

~/d/p/zig-budoux master• 5.7s ❱ nix run nixpkgs#python3 -- uni16.py test-utf16.txt
文字単位: ['ว', 'ั', 'น', 'น', 'ี', '้', 'อ', 'า', 'ก', 'า', 'ศ', 'ด', 'ี'], 長さ: 13
ว : THAI CHARACTER WO WAEN
 : THAI CHARACTER MAI HAN-AKAT
น : THAI CHARACTER NO NU
น : THAI CHARACTER NO NU
 : THAI CHARACTER SARA II
  : THAI CHARACTER MAI THO
อ : THAI CHARACTER O ANG
า : THAI CHARACTER SARA AA
ก : THAI CHARACTER KO KAI
า : THAI CHARACTER SARA AA
ศ : THAI CHARACTER SO SALA
ด : THAI CHARACTER DO DEK
 : THAI CHARACTER SARA II
~/d/p/zig-budoux master• ❱ nix run nixpkgs#python3 -- uni.py test.txt
文字単位: ['ว', 'ั', 'น', 'น', 'ี', '้', 'อ', 'า', 'ก', 'า', 'ศ', 'ด', 'ี'], 長さ: 13
ว : THAI CHARACTER WO WAEN
 : THAI CHARACTER MAI HAN-AKAT
น : THAI CHARACTER NO NU
น : THAI CHARACTER NO NU
 : THAI CHARACTER SARA II
  : THAI CHARACTER MAI THO
อ : THAI CHARACTER O ANG
า : THAI CHARACTER SARA AA
ก : THAI CHARACTER KO KAI
า : THAI CHARACTER SARA AA
ศ : THAI CHARACTER SO SALA
ด : THAI CHARACTER DO DEK
 : THAI CHARACTER SARA II

The chinese and japanese models do not use combining characters at all in their keys AFAIK

EDIT: I think I might found the issue in my implementation.

Cloudef commented 9 months ago

Unfortunately while I found issue in my implementation regarding the history bookkeeping it was unrelated to this, here are how the maps are accessed from n-2..

UW1: า => 209
UW2: ก => -260
UW3: า => 127
UW4: ศ => 0
UW5: ด => 344
UW6: ี => -52
BW1: กา => -468
BW2: าศ => 0
BW3: ศด => 0
TW1: ากา => 0
TW2: กาศ => 0
TW3: าศด => 0
TW4: ศดี => 0
UW1: ก => 404
UW2: า => 996
UW3: ศ => 0
UW4: ด => -992
UW5: ี => 1819
BW1: าศ => 246
BW2: ศด => 0
BW3: ดี => 853
TW1: กาศ => 0
TW2: าศด => 0
TW3: ศดี => 0
TW4: ดี => 0
UW1: า => 209
UW2: ศ => 438
UW3: ด => 964
UW4: ี => -6019
BW1: ศด => 0
BW2: ดี => 0
BW3: ี => 0
TW1: าศด => 0
TW2: ศดี => 0
TW3: ดี => 0
TW4: ี => 0
UW1: ศ => 0
UW2: ด => 590
UW3: ี => 310
BW1: ดี => 255
BW2: ี => 0
TW1: ศดี => 0
TW2: ดี => 0
TW3: ี => 0
Cloudef commented 9 months ago

Ah never mind, I just realized that python / javascript / java implementations always return the last chunk regardless of the score, so that's my dumb mistake :sweat_smile:

I think the normalization may still be important so I leave this issue open for a reference, you can close it if you want. Thanks for budoux btw.