Open eu9ene opened 3 months ago
Doing this for translating into English seems fine, but translating into Chinese gives me more doubts. As traditional Chinese corpora might be in Cantonese or other variants that are not Mandarin. Mixing individual variants of a macro lang in the decoder might increase the number of translation instances that contradict with each other. May this be a case of use for for opus-trainer? Where, for example, doing en->zh-hant, has all hans and hant data during a first phase. Then a second phase including olnly hans?
Yeah, I'm quite confused about all these variants and their availability on OPUS.
Chinese (zh): 200M sentences Chinese Taiwan (Mandarin Traditional I assume): 20M Chinese Hong Kong (assuming Traditional, but is it even Mandarin?): 11K Cantonese (which we're not planning to train yet): 9K
So far I have implemented converting everything to Simplified as it's our main goal for now with a big user base in Mainland China (actually we need to double-check our stats as I didn't see what's the split between Mainland and Taiwan).
I wonder if just filtering Cantonese and converting Mandarin between Traditional and Simplified scripts will work for training separate models. I see fast-text supports yue
language code (Cantonese) and zh
. I would assume it filters Cantonese then but not sure, I need to double check that.
May this be a case of use for for opus-trainer? Where, for example, doing en->zh-hant, has all hans and hant data during a first phase. Then a second phase including olnly hans?
It's totally possible if we have enough data for each stage. I envision it to be quite expensive to implement though because we'll need to change the pipeline significantly and experimentally tune the pre-training/finetuning parameters and we're already struggling with this for back-translations. So if we can get away with filtering and conversion, that would be great.
I wonder if just filtering Cantonese and converting Mandarin between Traditional and Simplified scripts will work for training separate models. I see fast-text supports yue language code (Cantonese) and zh. I would assume it filters Cantonese then but not sure, I need to double check that.
There's also the problem that language identifiers tend to be very bad, specially at sentence/few words level, to distinguish between Chinese variants.
Chinese script comes in traditional and simplified variety. Most big translation vendors support both. Converting traditional to simplified (and vice versa) can be easily achieved through hanzi-conv https://pypi.org/project/hanziconv/0.3/ . There might be a very small information loss when converting simplified to traditional, but it should be fine in 99.9% of the cases. Some datasets such as ted talks come in traditional, so they should be converted before using.
See also this comment by Jaume: https://github.com/mozilla/firefox-translations-training/issues/45#issuecomment-1036191497