Closed kpu closed 2 years ago
Looking at dataset.py
I see that at least for WMT, both directions are explicitly stated out for downloading purposes. For mTEDx, that is not the case. @esalesky @mjpost should have more info on this probably.
@kpu those are the exact same commands AFAICS? I assume you meant en-fr
? This seems to be @ozancaglayan’s assumption but we should be sure.
The answer is that this is only tangentially related to the origlang issue. We’ve always required each direction to be explicitly listed because the datasets were not always symmetric when sacrebleu was born (had they been, I probably would have taken the short-sighted shortcut of assuming symmetricity).
mTEDx is a dataset whose original audio is not in English [edit: typo, clarity], so it is highly directional. You /could/ use it backwards (as you could with any dataset) with all the entailed caveats, but we didn’t list it because it wasn’t really relevant.
for mTEDx it's as @mjpost said — the dataset is directional, as each language pair contains talks with audio in the source language which were translated. this means for example that the es-fr
directional data is not the same set of talks as the fr-es
direction: the validation and testing data is aligned across source languages so there would be no contamination with auxiliary ASR tasks (fr-es
,fr-en
are the same talks, as are es-en
,es-fr
, but these are not the same talks as each other). there is no English source audio.
if you are using the dataset for MT without speech, you could of course use the text data for translation for en-fr
, or combine for example the es-fr
and fr-es
data, but it's not safe to assume both directions of a pair are the same and both are present.
You're right, I meant
sacrebleu -t mtedx/valid -l en-fr --echo src
doesn't work. Though that seems intentional from what I read above which answers the question.
This works:
but this doesn't:
The output is:
Is this directional behavior intentional? It seems different from WMT with
--origlang
, but that may be down to how the test sets are described by organizers.