mtdata sources include BCP-47 datasets with tag format being xxx_Yyyy_ZZ where Yyyy and ZZ are optional. Compressed download from these includes the tag in the extension e.g. downloading
- mtdata_Statmt-ccaligned-1-eng-zho_CN
Results in:
Statmt-ccaligned-1-eng-zho_CN.eng.gz and Statmt-ccaligned-1-eng-zho_CN.zho_CN.gz
Note extension .zho_CN.gz
Current mtdata importer assumes dataset is ISO 639-3 and does not check for script or region in output file resulting in the following.
mv .../Statmt-ccaligned-1-eng-zho_CN.zho.gz .../mtdata_Statmt-ccaligned-1-eng-zho_CN.zh.gzmv: cannot stat '.../train-parts/Statmt-ccaligned-1-eng-zho_CN.zho.gz': No such file or directory
I think this is still valid. I'm guessing our task will fail in Taskcluster if and when it comes up. We only need to fix it when a dataset triggers it though.
mtdata sources include BCP-47 datasets with tag format being xxx_Yyyy_ZZ where Yyyy and ZZ are optional. Compressed download from these includes the tag in the extension e.g. downloading
- mtdata_Statmt-ccaligned-1-eng-zho_CN
Results in:
Statmt-ccaligned-1-eng-zho_CN.eng.gz
andStatmt-ccaligned-1-eng-zho_CN.zho_CN.gz
Current mtdata importer assumes dataset is ISO 639-3 and does not check for script or region in output file resulting in the following.
mv .../Statmt-ccaligned-1-eng-zho_CN.zho.gz .../mtdata_Statmt-ccaligned-1-eng-zho_CN.zh.gz
mv: cannot stat '.../train-parts/Statmt-ccaligned-1-eng-zho_CN.zho.gz': No such file or directory