Open dorost1234 opened 3 years ago
You can filter the languages this way:
tydiqa_en = tydiqa_dataset.filter(lambda x: x["language"] == "english")
Otherwise maybe we can have one configuration per language ? What do you think of this for example ?
load_dataset("tydiqa", "primary_task.en")
Hi thank you very much for the great response, this will be really wonderful to have one configuration per language, as one need the dataset in majority of case per language for cross-lingual evaluations. This becomes also then more close to TFDS format, which is separated per language https://www.tensorflow.org/datasets/catalog/tydi_qa which will be really awesome to have. thanks
On Mon, Mar 29, 2021 at 6:17 PM Quentin Lhoest @.***> wrote:
You can filter the languages this way:
tydiqa_en = tydiqa_dataset.filter(lambda x: x["language"] == "english")
Otherwise maybe we can have one configuration per language ? What do you think of this for example ?
load_dataset("tydiqa", "primary_task.en")
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/2132#issuecomment-809516799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS37NMXPW2PWSQ2RHG73O7TTGCY4LANCNFSM4Z7ER7IA .
@lhoestq I greatly appreciate any updates on this. thanks a lot
Hi @lhoestq Currently TydiQA is mixed and user can only access the whole training set of all languages: https://www.tensorflow.org/datasets/catalog/tydi_qa
for using this dataset, one need to train/evaluate in each separate language, and having them mixed, makes it hard to use this dataset. This is much convenient for user to have them split and I appreciate your help on this.
Meanwhile, till hopefully this is split per language, I greatly appreciate telling me how I can preprocess and get data per language. thanks a lot