Support for multilingual generative metrics

hynky1999 commented 2 months ago

What does this implement/fix? Explain your changes.

This PR adds two "new" metrics for generative evaluation
- multilingual_quasi_exact_match_metric (Normalizer is similar to helm one, but supports multiple langauges. It's mostly inspired by https://github.com/facebookresearch/MLQA/blob/main/mlqa_evaluation_v1.py#L47)
- multilingual_quasi_f1_score_metric
It adds Sentence/Word tokenizers to the library. I literally copied them from datatrove as we added support there in near past. I don't want to depend on datatrove thus decided to do this. It's possibly that the definitions will fork, but that's ok and we can sync after some time.
Similar to tokenizers I also took language definitions from datatrove
Minor typing upgrades

hynky1999 commented 2 months ago

Great ! Do you have a task example where this is used ?

Yes, we are using this metric for multilingual generative tasks, I plan to add them once this PR is merged in a bulk

hynky1999 commented 2 months ago

Don't we expect the model tokenizer and evaluation tokenizer to behave similarly though? No, cause as said the tokenizers have different purpose