Closed tuzhucheng closed 2 years ago
Hi @tuzhucheng,
We use align_wikidata.py to retrieve corresponding entities and then augment the original NQ training data with the aligned answer, by simply replacing the original English answers with the corresponding entities and add language tags at the last of the questions. If it helps, I'll add the short script to this repository.
Thanks Akari, that answers my question! There is no need to add the additional script. It also matches the description in Appendix A.2 Details of the Data Mining Process in the CORA paper.
For baseline 1 mDPR_mia_train_data_non_iterative_biencoder_best.cpt
, are there any passages mined using Wikipedia language links and predicted to be positive / negative by mGEN used to train mDPR, or is the augmentation mentioned above only used for mGEN? In other words, are mia2022_mdpr_train.json
and mia_train_adversarial.json
the only datasets used to train mDPR_mia_train_data_non_iterative_biencoder_best.cpt
?
For baseline 1, we only augmented data for mGEN. We do not use mia_train_adversarial.json
for the baseline 1 mDPR training but decided to release the data as well hoping it helps participants to develop their systems :)
Thanks for the clarifications! I will close the issue.
Hi, I am just curious if the code for augmenting the Natural Questions data using the Wikipedia language links will be released. Thanks!