mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
149 stars 31 forks source link

Add bcp 47 code support in mtdata importer. #76

Open kopachef opened 2 years ago

kopachef commented 2 years ago

mtdata sources include BCP-47 datasets with tag format being xxx_Yyyy_ZZ where Yyyy and ZZ are optional. Compressed download from these includes the tag in the extension e.g. downloading

- mtdata_Statmt-ccaligned-1-eng-zho_CN

Results in: Statmt-ccaligned-1-eng-zho_CN.eng.gz and Statmt-ccaligned-1-eng-zho_CN.zho_CN.gz

Current mtdata importer assumes dataset is ISO 639-3 and does not check for script or region in output file resulting in the following.

mv .../Statmt-ccaligned-1-eng-zho_CN.zho.gz .../mtdata_Statmt-ccaligned-1-eng-zho_CN.zh.gz mv: cannot stat '.../train-parts/Statmt-ccaligned-1-eng-zho_CN.zho.gz': No such file or directory

XapaJIaMnu commented 2 years ago

I was just about to open the same bug report. +1

gregtatum commented 6 months ago

I think this is still valid. I'm guessing our task will fail in Taskcluster if and when it comes up. We only need to fix it when a dataset triggers it though.