TydiQA dataset is mixed and is not split per language

dorost1234 commented 3 years ago

Hi @lhoestq Currently TydiQA is mixed and user can only access the whole training set of all languages: https://www.tensorflow.org/datasets/catalog/tydi_qa

for using this dataset, one need to train/evaluate in each separate language, and having them mixed, makes it hard to use this dataset. This is much convenient for user to have them split and I appreciate your help on this.

Meanwhile, till hopefully this is split per language, I greatly appreciate telling me how I can preprocess and get data per language. thanks a lot

lhoestq commented 3 years ago

You can filter the languages this way:

tydiqa_en = tydiqa_dataset.filter(lambda x: x["language"] == "english")

Otherwise maybe we can have one configuration per language ? What do you think of this for example ?

load_dataset("tydiqa", "primary_task.en")

dorost1234 commented 3 years ago

Hi thank you very much for the great response, this will be really wonderful to have one configuration per language, as one need the dataset in majority of case per language for cross-lingual evaluations. This becomes also then more close to TFDS format, which is separated per language https://www.tensorflow.org/datasets/catalog/tydi_qa which will be really awesome to have. thanks

On Mon, Mar 29, 2021 at 6:17 PM Quentin Lhoest @.***> wrote:

You can filter the languages this way:

tydiqa_en = tydiqa_dataset.filter(lambda x: x["language"] == "english")

Otherwise maybe we can have one configuration per language ? What do you think of this for example ?

load_dataset("tydiqa", "primary_task.en")

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/2132#issuecomment-809516799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS37NMXPW2PWSQ2RHG73O7TTGCY4LANCNFSM4Z7ER7IA .

dorost1234 commented 3 years ago

@lhoestq I greatly appreciate any updates on this. thanks a lot

huggingface / datasets

TydiQA dataset is mixed and is not split per language #2132