mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
135 stars 28 forks source link

Serbian is digraphic with both Latin and Cyrllic which will cause some issues for training #681

Open gregtatum opened 1 week ago

gregtatum commented 1 week ago

Here is a basic distribution of newscrawl data (2019), being mixed Latin and Cyrillic. Serbian is digraphic, which means it is fine to use either. Generally when I looked at the data, it was either one or the other, not mixed.

U+0061 a 36823506   Ll Basic Latin [a-z]
U+0069 i 28768820   Ll Basic Latin [a-z]
U+006F o 28047788   Ll Basic Latin [a-z]
U+0065 e 27726598   Ll Basic Latin [a-z]
U+006E n 19630607   Ll Basic Latin [a-z]
U+0072 r 15944828   Ll Basic Latin [a-z]
U+0073 s 14400506   Ll Basic Latin [a-z]
U+006A j 13461323   Ll Basic Latin [a-z]
U+0074 t 13311291   Ll Basic Latin [a-z]
U+0075 u 13238157   Ll Basic Latin [a-z]
U+0064 d 12106054   Ll Basic Latin [a-z]
U+006B k 10952163   Ll Basic Latin [a-z]
U+0076 v 10500989   Ll Basic Latin [a-z]
U+006C l 10164368   Ll Basic Latin [a-z]
U+006D m 9739281    Ll Basic Latin [a-z]
U+0070 p 9012524    Ll Basic Latin [a-z]
U+0430 а 6502503    Ll Cyrillic [а-я]
U+007A z 5265714    Ll Basic Latin [a-z]
U+0438 и 5258341    Ll Cyrillic [а-я]
U+043E о 5033584    Ll Cyrillic [а-я]
U+0067 g 4962589    Ll Basic Latin [a-z]
U+0062 b 4927764    Ll Basic Latin [a-z]
U+0435 е 4860633    Ll Cyrillic [а-я]
U+043D н 3187841    Ll Cyrillic [а-я]
U+0063 c 2974503    Ll Basic Latin [a-z]
U+0440 р 2938144    Ll Cyrillic [а-я]
U+0161 š 2923692    Ll Latin Extended-A
U+010D č 2688552    Ll Latin Extended-A
U+0441 с 2673526    Ll Cyrillic [а-я]
U+0442 т 2387137    Ll Cyrillic [а-я]
U+0443 у 2318061    Ll Cyrillic [а-я]
U+0107 ć 2170082    Ll Latin Extended-A
U+0434 д 2133005    Ll Cyrillic [а-я]
U+043A к 1997802    Ll Cyrillic [а-я]
U+0432 в 1940220    Ll Cyrillic [а-я]
U+0068 h 1879917    Ll Basic Latin [a-z]
U+0458 ј 1846705    Ll Cyrillic
U+043C м 1728057    Ll Cyrillic [а-я]
U+017E ž 1692347    Ll Latin Extended-A
U+043F п 1615131    Ll Cyrillic [а-я]
U+043B л 1502708    Ll Cyrillic [а-я]
U+0066 f 994269     Ll Basic Latin [a-z]
U+0437 з 931739     Ll Cyrillic [а-я]
U+0431 б 878234     Ll Cyrillic [а-я]
U+0433 г 866529     Ll Cyrillic [а-я]
U+0111 đ 689848     Ll Latin Extended-A
U+0448 ш 511149     Ll Cyrillic [а-я]
U+0447 ч 496788     Ll Cyrillic [а-я]
U+0446 ц 485749     Ll Cyrillic [а-я]
U+045B ћ 385264     Ll Cyrillic
U+045A њ 375473     Ll Cyrillic
U+0445 х 315929     Ll Cyrillic [а-я]
U+0436 ж 274141     Ll Cyrillic [а-я]
U+0459 љ 240534     Ll Cyrillic
U+0444 ф 165158     Ll Cyrillic [а-я]
U+0452 ђ 125511     Ll Cyrillic

I think this is fine for the sr-en direction, as the vocabulary will just be bigger. A quick test of Google Translate shows that it will happily accept mixed scripts. I assume we'll take a small hit in quality, or may need to adjust the vocab size bigger.

The problem with our pipeline is for en-sr. We'll need to choose to support either Cyrillic or Latin as the target. It looks like Google Translate uses Cyrillic when it outputs a translation. We'll either need to pre-filter all of our data to be the correct script, or add support to the pipeline to filter by the script used.

A quick check of NLLB shows that it is mixed script as well.