texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
531 stars 100 forks source link

Tevatron/wiki-ss-nq Dataset issue #158

Closed maxjeblick closed 3 weeks ago

maxjeblick commented 3 weeks ago

Loading Tevatron/wiki-ss-nq dataset fails using the following code snippet (classes are from examples/des/dataset.py):

data_args = DataArguments(
    dataset_name="Tevatron/wiki-ss-nq",
    corpus_name="Tevatron/wiki-ss-corpus",
    train_group_size=2,
    query_max_len=128,
    passage_max_len=4096,
)
dataset = TrainDataset(data_args)
print(dataset[0])

The error obtained is the same error that is also present in the corresponding HuggingFace dataset page:

All the data files must have the same columns, but at some point there are 1 new columns ({'answers'}) and 2 missing columns ({'positive_passages', 'negative_passages'}).

The issue is also mentioned here: https://huggingface.co/datasets/Tevatron/wiki-ss-nq/discussions/2

MXueguang commented 3 weeks ago

weird...let me take a look today

maxjeblick commented 3 weeks ago

Thanks for looking into it. In particular,

from datasets import load_dataset
ds = load_dataset("Tevatron/wiki-ss-nq")

fails when creating the test split, as it seems the test split doesn't contain pos/neg passages columns.

MXueguang commented 3 weeks ago

does it works for you now? I tried to fixed it

maxjeblick commented 3 weeks ago

Dataset loading works, thanks for the quick fix!