littlewine / ZeCo2

MIT License
3 stars 0 forks source link

Question about the corpus #3

Closed jinzhuoran closed 1 year ago

jinzhuoran commented 1 year ago

Hi, @littlewine. I read your paper with great interest, and I would like to keep up with your work. Could you give me the download link for CAR, WAPO and KILT corpus and the construction method of collection_mapping? Thanks for your interesting work!

littlewine commented 1 year ago

Hi and thanks for the interest. I cannot distribute the collections, since most of them are under a license from TREC. please contact the trec cast organizers to acquire them.

to construct the collection_mapping you can use the code here: https://github.com/littlewine/ZeCo2/blob/main/preprocessing/preprocessing_docids_to_int.py it simply turns string docids to integers to be compatible with the colbert code, and keeps a mapping (then you have to provide the collection mapping paths of each collection to paths.py)