mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

Discrepancy in docstrings in TER `normalized` #230

Closed BramVanroy closed 1 year ago

BramVanroy commented 1 year ago

In TER, the normalized argument has the following docstring:

https://github.com/mjpost/sacrebleu/blob/4f4124642c4eb0b7120e50119c669f0570a326a7/sacrebleu/metrics/ter.py#L78

This is passed to the TER tokenizer. However, there it has a different meaning altogether:

https://github.com/mjpost/sacrebleu/blob/4f4124642c4eb0b7120e50119c669f0570a326a7/sacrebleu/tokenizers/tokenizer_ter.py#L129

It seems to me that the description in TER() is incorrect. If you can confirm that, I can do a PR to change the docstring.

ozancaglayan commented 1 year ago

Yes it seems that the one in the tokenizer makes more sense. It would be good if you can do a PR and slightly elaborate more in the docstring to reflect what it does more or less? (I'm seeing calls to western and asian normalization methods)

Thanks!

ozancaglayan commented 1 year ago

If you would like to contribute to this issue, maybe you can slightly improve your PR to also update the README to handle the following issue? https://github.com/mjpost/sacrebleu/issues/229