texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
435 stars 87 forks source link

How can I obtain the Wiki-ss dataset? #136

Closed wuzhi19931128 closed 6 days ago

wuzhi19931128 commented 1 week ago

How can I obtain the Wiki-ss dataset mentioned in the paper https://arxiv.org/pdf/2406.11251?

MXueguang commented 1 week ago

Hi @wuzhi19931128 thanks for your interest. I am uploading it to huggingface, should be ready today.

MXueguang commented 1 week ago

@wuzhi19931128

the file is above the 300g limit of huggingface, so I current host it here https://storage.googleapis.com/tevatron-vision/wiki-ss-hf-data.tar

please download with this link via wget and then load the data with datasets.load_from_disk

wuzhi19931128 commented 6 days ago

grateful!