Code for a new machine translation benchmark, Tatoeba

Hi, I'm proposing to integrate the Tatoeba machine translation dataset into sotabench-eval. I have included code for running the tests, modeled after WMT, and for downloading and configuring the data. I'm not 100% sure how the caching is supposed to work at the moment, I'll come back to that.

Currently you can:

import sotabencheval
from sotabencheval.machine_translation import TatoebaEvaluator, TatoebaDataset

# The test data will be downloaded and unpacked under the directory "tatoeba", this only needs to be done if the data isn't already present
sotabencheval.machine_translation.tatoeba.fetch_and_configure_data("tatoeba")
evaluator = TatoebaEvaluator(dataset=TatoebaDataset.v1, source_lang="eng", target_lang="deu", local_root="tatoeba", model_name="Some model", paper_arxiv_id="Some id")

evaluator.add({1: "Tom mag die italienische Küche.", 2: "Hier wirst du viel lernen."})
print(evaluator.get_results(ignore_missing = True))

You should be able to merge this without breaking anything, but please point me towards what else needs to be done...

paperswithcode / sotabench-eval

Code for a new machine translation benchmark, Tatoeba #15