Code for data augmentation using Wikipedia language links

mia-workshop / MIA-Shared-Task-2022

An official repository for MIA 2022 (NAACL 2022 Workshop) Shared Task on Cross-lingual Open-Retrieval Question Answering.

MIT License

31 stars 4 forks source link

Code for data augmentation using Wikipedia language links #8

Closed tuzhucheng closed 2 years ago

tuzhucheng commented 2 years ago

Hi, I am just curious if the code for augmenting the Natural Questions data using the Wikipedia language links will be released. Thanks!

AkariAsai commented 2 years ago

Hi @tuzhucheng,

We use align_wikidata.py to retrieve corresponding entities and then augment the original NQ training data with the aligned answer, by simply replacing the original English answers with the corresponding entities and add language tags at the last of the questions. If it helps, I'll add the short script to this repository.

tuzhucheng commented 2 years ago

Thanks Akari, that answers my question! There is no need to add the additional script. It also matches the description in Appendix A.2 Details of the Data Mining Process in the CORA paper.

For baseline 1 mDPR_mia_train_data_non_iterative_biencoder_best.cpt, are there any passages mined using Wikipedia language links and predicted to be positive / negative by mGEN used to train mDPR, or is the augmentation mentioned above only used for mGEN? In other words, are mia2022_mdpr_train.json and mia_train_adversarial.json the only datasets used to train mDPR_mia_train_data_non_iterative_biencoder_best.cpt?

AkariAsai commented 2 years ago

For baseline 1, we only augmented data for mGEN. We do not use mia_train_adversarial.json for the baseline 1 mDPR training but decided to release the data as well hoping it helps participants to develop their systems :)

tuzhucheng commented 2 years ago

Thanks for the clarifications! I will close the issue.