question about msmarco passage ranking dataset

texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.

http://tevatron.ai

Apache License 2.0

486 stars 92 forks source link

question about msmarco passage ranking dataset #66

Open lboesen opened 1 year ago

lboesen commented 1 year ago

Hi :)

Regarding the msmarco passages dataset that gets downloaded from the hugginface (https://huggingface.co/datasets/Tevatron/msmarco-passage/tree/main).

How was this dataset created? as it doesnt match any of the dataset on the original microsoft site(https://microsoft.github.io/msmarco/Datasets.html)

Thanks in advance

Tan-Hexiang commented 1 year ago

I have a similar question. The NQ dataset from this(https://huggingface.co/datasets/Tevatron/wikipedia-nq/tree/main) is not the same as the general used NQ dataset from DPR paper(https://arxiv.org/abs/2004.04906).

i found the problem because the Tevatron/wikipedia-nq dev has only 6489 queries. the dpr NQ dev has 8,757 queies. The origin NQ dev has 7,830 queries.

How was the NQ dataset created? Or which paper does the dataset come from? @MXueguang

MXueguang commented 1 year ago

Hi @Tan-Hexiang, I think I used the code below while filtering the train and dev set.

data = json.load(open("biencoder-nq-dev.json"))
count = 0
for example in data:
    if len(example['positive_ctxs']) > 0 and len(example['hard_negative_ctxs']) >= 8:
        count += 1
print(count)

the file biencoder-nq-dev.json is from the original dpr repo, it contains 6.6k questions. https://github.com/facebookresearch/DPR/blob/a31212dc0a54dfa85d8bfa01e1669f149ac832b7/dpr/data/download_data.py#L38

The reason we did the above filter before is we found having 8 hard negatives in a group sometimes give better effectiveness in our early experiments.

Tan-Hexiang commented 1 year ago

@MXueguang Thanks for your reply！ The file biencoder-nq-dev.json description pointed out it can only be used for the Retriever train time validation.

Instead, when validating the retrieval results, maybe the nq-dev.qa.csv file should be used.

I am confused about which file to use when validate the retrieval results in example_dpr.md. As far as I know DPR use the nq-dev.qa.csv that has 8757 queries for validation. So for a fair comparison, i think we should also use the same file as DPR instead of using file with 6.6k questions.

Tan-Hexiang commented 1 year ago

Concretely, which dev file the top-k accuracy below corresponds to? nq-dev.qa.csv with 8757 questions? or filtered biencoder-nq-dev.json with 6489 questions.

MXueguang commented 1 year ago

Following the original dpr work, all the evaluation was on test set.

  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq/test \
  --encoded_save_path query_emb.pkl \
  --encode_is_qry

here we are encoding test set question for evaluation

Y1Jia commented 1 year ago

I have the same question about how the Tevatron/msmarco-passage dataset was created @MXueguang

MXueguang commented 1 year ago

Hi @Y1Jia , Tevatron/msmarco-passage data is created from https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco#get-data

Y1Jia commented 1 year ago

Thank you for getting back to me so fast!