mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.07k stars 164 forks source link

[Question] TEDx directionality versus WMT origlang #177

Closed kpu closed 2 years ago

kpu commented 2 years ago

This works:

sacrebleu -t mtedx/valid -l fr-en --echo src

but this doesn't:

sacrebleu -t mtedx/valid -l fr-en --echo src

The output is:

sacreBLEU: No such language pair 'en-fr'
sacreBLEU: Available language pairs for 'mtedx/valid' are:
sacreBLEU:  > el-en
sacreBLEU:  > es-en
sacreBLEU:  > es-fr
sacreBLEU:  > es-it
sacreBLEU:  > es-pt
sacreBLEU:  > fr-en
sacreBLEU:  > fr-es
sacreBLEU:  > fr-pt
sacreBLEU:  > it-en
sacreBLEU:  > it-es
sacreBLEU:  > pt-en
sacreBLEU:  > pt-es
sacreBLEU:  > ru-en

Is this directional behavior intentional? It seems different from WMT with --origlang, but that may be down to how the test sets are described by organizers.

ozancaglayan commented 2 years ago

Looking at dataset.py I see that at least for WMT, both directions are explicitly stated out for downloading purposes. For mTEDx, that is not the case. @esalesky @mjpost should have more info on this probably.

mjpost commented 2 years ago

@kpu those are the exact same commands AFAICS? I assume you meant en-fr? This seems to be @ozancaglayan’s assumption but we should be sure.

The answer is that this is only tangentially related to the origlang issue. We’ve always required each direction to be explicitly listed because the datasets were not always symmetric when sacrebleu was born (had they been, I probably would have taken the short-sighted shortcut of assuming symmetricity).

mTEDx is a dataset whose original audio is not in English [edit: typo, clarity], so it is highly directional. You /could/ use it backwards (as you could with any dataset) with all the entailed caveats, but we didn’t list it because it wasn’t really relevant.

esalesky commented 2 years ago

for mTEDx it's as @mjpost said — the dataset is directional, as each language pair contains talks with audio in the source language which were translated. this means for example that the es-fr directional data is not the same set of talks as the fr-es direction: the validation and testing data is aligned across source languages so there would be no contamination with auxiliary ASR tasks (fr-es,fr-en are the same talks, as are es-en,es-fr, but these are not the same talks as each other). there is no English source audio.

if you are using the dataset for MT without speech, you could of course use the text data for translation for en-fr, or combine for example the es-fr and fr-es data, but it's not safe to assume both directions of a pair are the same and both are present.

kpu commented 2 years ago

You're right, I meant

sacrebleu -t mtedx/valid -l en-fr --echo src

doesn't work. Though that seems intentional from what I read above which answers the question.