Closed e-tornike closed 2 years ago
Hy @e-tornike thank you for the questions! Following the original XOR-TyDi QA datasets, we use the Wikipedia dump from February 2019. For the languages the 2019 February dump is no longer available, we use November 2021 data.
You can find the links of web archive version of Wikipedia dumps at the TyDi QA repository below: tydiqa-source data For the languages that are not listed here, here are the links:
You can download preprocessed text data where we split each article in all target languages into 100 token chuncks and concatenate all of them. The download links are below:
wget https://nlp.cs.washington.edu/xorqa/cora/models/mia2022_shared_task_all_langs_w100.tsv
We will add this instruction in our README. Section for your feedback! Let us know if you have any more questions. The code used to preprocess Wikipedia is at baseline/wikipedia_preprocess.
We highly recommend using these Wikipedia dumps, but using Wikipedia dumps of the languages from different time stamps is not prohibited.
Thank you for the detailed reply! This answers my question.
No problem! Thank you for raising the question!
Hi there @AkariAsai @shayne-longpre,
is there a specific version of the Wikipedia dump that you would recommend all participants to use (which is not detailed here), or can we pick freely?
Cheers.