ielab / CharacterBERT-DR

The offcial repository for 'CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos', SIGIR2022
Apache License 2.0
14 stars 7 forks source link

data/marco_dev #2

Closed whale-z closed 1 year ago

whale-z commented 1 year ago

Hi, I was wondering how you got the 6980 data in the marco_dev folder from the MS MARCO dev set? best wishes!

ArvinZhuang commented 1 year ago

The original queries are downloaded from here (dev small in queries.tar.gz in passage ranking dataset)

the generated typo queries and spell checker corrected queries are generated using scripts in (https://github.com/ielab/CharacterBERT-DR/tree/main/data)

whale-z commented 1 year ago

Thank you very much for your reply. But as far as I know, the queries.dev.tsv file under queries.tar.gz contains 101093 data, how did you filter the 6980 data out of these data? Is it just a random sampling strategy?

ArvinZhuang commented 1 year ago

Hi

Im sorry, the eval small queries should be in collectionandqueries.tar.gz. This is a sub-set of eval queries and it is used for the leaderboard eval and so in many research papers.

whale-z commented 1 year ago

This is exactly what I want. Thank you very much for your help.