Here is a basic distribution of newscrawl data (2019), being mixed Latin and Cyrillic. Serbian is digraphic, which means it is fine to use either. Generally when I looked at the data, it was either one or the other, not mixed.
U+0061 a 36823506 Ll Basic Latin [a-z]
U+0069 i 28768820 Ll Basic Latin [a-z]
U+006F o 28047788 Ll Basic Latin [a-z]
U+0065 e 27726598 Ll Basic Latin [a-z]
U+006E n 19630607 Ll Basic Latin [a-z]
U+0072 r 15944828 Ll Basic Latin [a-z]
U+0073 s 14400506 Ll Basic Latin [a-z]
U+006A j 13461323 Ll Basic Latin [a-z]
U+0074 t 13311291 Ll Basic Latin [a-z]
U+0075 u 13238157 Ll Basic Latin [a-z]
U+0064 d 12106054 Ll Basic Latin [a-z]
U+006B k 10952163 Ll Basic Latin [a-z]
U+0076 v 10500989 Ll Basic Latin [a-z]
U+006C l 10164368 Ll Basic Latin [a-z]
U+006D m 9739281 Ll Basic Latin [a-z]
U+0070 p 9012524 Ll Basic Latin [a-z]
U+0430 а 6502503 Ll Cyrillic [а-я]
U+007A z 5265714 Ll Basic Latin [a-z]
U+0438 и 5258341 Ll Cyrillic [а-я]
U+043E о 5033584 Ll Cyrillic [а-я]
U+0067 g 4962589 Ll Basic Latin [a-z]
U+0062 b 4927764 Ll Basic Latin [a-z]
U+0435 е 4860633 Ll Cyrillic [а-я]
U+043D н 3187841 Ll Cyrillic [а-я]
U+0063 c 2974503 Ll Basic Latin [a-z]
U+0440 р 2938144 Ll Cyrillic [а-я]
U+0161 š 2923692 Ll Latin Extended-A
U+010D č 2688552 Ll Latin Extended-A
U+0441 с 2673526 Ll Cyrillic [а-я]
U+0442 т 2387137 Ll Cyrillic [а-я]
U+0443 у 2318061 Ll Cyrillic [а-я]
U+0107 ć 2170082 Ll Latin Extended-A
U+0434 д 2133005 Ll Cyrillic [а-я]
U+043A к 1997802 Ll Cyrillic [а-я]
U+0432 в 1940220 Ll Cyrillic [а-я]
U+0068 h 1879917 Ll Basic Latin [a-z]
U+0458 ј 1846705 Ll Cyrillic
U+043C м 1728057 Ll Cyrillic [а-я]
U+017E ž 1692347 Ll Latin Extended-A
U+043F п 1615131 Ll Cyrillic [а-я]
U+043B л 1502708 Ll Cyrillic [а-я]
U+0066 f 994269 Ll Basic Latin [a-z]
U+0437 з 931739 Ll Cyrillic [а-я]
U+0431 б 878234 Ll Cyrillic [а-я]
U+0433 г 866529 Ll Cyrillic [а-я]
U+0111 đ 689848 Ll Latin Extended-A
U+0448 ш 511149 Ll Cyrillic [а-я]
U+0447 ч 496788 Ll Cyrillic [а-я]
U+0446 ц 485749 Ll Cyrillic [а-я]
U+045B ћ 385264 Ll Cyrillic
U+045A њ 375473 Ll Cyrillic
U+0445 х 315929 Ll Cyrillic [а-я]
U+0436 ж 274141 Ll Cyrillic [а-я]
U+0459 љ 240534 Ll Cyrillic
U+0444 ф 165158 Ll Cyrillic [а-я]
U+0452 ђ 125511 Ll Cyrillic
I think this is fine for the sr-en direction, as the vocabulary will just be bigger. A quick test of Google Translate shows that it will happily accept mixed scripts. I assume we'll take a small hit in quality, or may need to adjust the vocab size bigger.
The problem with our pipeline is for en-sr. We'll need to choose to support either Cyrillic or Latin as the target. It looks like Google Translate uses Cyrillic when it outputs a translation. We'll either need to pre-filter all of our data to be the correct script, or add support to the pipeline to filter by the script used.
A quick check of NLLB shows that it is mixed script as well.
Here is a basic distribution of newscrawl data (2019), being mixed Latin and Cyrillic. Serbian is digraphic, which means it is fine to use either. Generally when I looked at the data, it was either one or the other, not mixed.
I think this is fine for the
sr-en
direction, as the vocabulary will just be bigger. A quick test of Google Translate shows that it will happily accept mixed scripts. I assume we'll take a small hit in quality, or may need to adjust the vocab size bigger.The problem with our pipeline is for
en-sr
. We'll need to choose to support either Cyrillic or Latin as the target. It looks like Google Translate uses Cyrillic when it outputs a translation. We'll either need to pre-filter all of our data to be the correct script, or add support to the pipeline to filter by the script used.A quick check of NLLB shows that it is mixed script as well.