Wiki dump version - Githubissues

e-tornike commented 2 years ago

Hi there @AkariAsai @shayne-longpre,

is there a specific version of the Wikipedia dump that you would recommend all participants to use (which is not detailed here), or can we pick freely?

Cheers.

AkariAsai commented 2 years ago

Hy @e-tornike thank you for the questions! Following the original XOR-TyDi QA datasets, we use the Wikipedia dump from February 2019. For the languages the 2019 February dump is no longer available, we use November 2021 data.

Link to the Wikipedia dumps

You can find the links of web archive version of Wikipedia dumps at the TyDi QA repository below: tydiqa-source data For the languages that are not listed here, here are the links:

Preprocessed data in the DPR (100-token) format

You can download preprocessed text data where we split each article in all target languages into 100 token chuncks and concatenate all of them. The download links are below:

wget https://nlp.cs.washington.edu/xorqa/cora/models/mia2022_shared_task_all_langs_w100.tsv

We will add this instruction in our README. Section for your feedback! Let us know if you have any more questions. The code used to preprocess Wikipedia is at baseline/wikipedia_preprocess.

AkariAsai commented 2 years ago

We highly recommend using these Wikipedia dumps, but using Wikipedia dumps of the languages from different time stamps is not prohibited.

e-tornike commented 2 years ago

Thank you for the detailed reply! This answers my question.

AkariAsai commented 2 years ago

No problem! Thank you for raising the question!

mia-workshop / MIA-Shared-Task-2022

Wiki dump version #2

Link to the Wikipedia dumps

Preprocessed data in the DPR (100-token) format