Open lboesen opened 1 year ago
I have a similar question. The NQ dataset from this(https://huggingface.co/datasets/Tevatron/wikipedia-nq/tree/main) is not the same as the general used NQ dataset from DPR paper(https://arxiv.org/abs/2004.04906).
i found the problem because the Tevatron/wikipedia-nq dev has only 6489 queries. the dpr NQ dev has 8,757 queies. The origin NQ dev has 7,830 queries.
How was the NQ dataset created? Or which paper does the dataset come from? @MXueguang
Hi @Tan-Hexiang, I think I used the code below while filtering the train and dev set.
data = json.load(open("biencoder-nq-dev.json"))
count = 0
for example in data:
if len(example['positive_ctxs']) > 0 and len(example['hard_negative_ctxs']) >= 8:
count += 1
print(count)
the file biencoder-nq-dev.json
is from the original dpr repo, it contains 6.6k questions.
https://github.com/facebookresearch/DPR/blob/a31212dc0a54dfa85d8bfa01e1669f149ac832b7/dpr/data/download_data.py#L38
The reason we did the above filter before is we found having 8 hard negatives in a group sometimes give better effectiveness in our early experiments.
@MXueguang Thanks for your reply!
The file biencoder-nq-dev.json
description pointed out it can only be used for the Retriever train time validation.
Instead, when validating the retrieval results, maybe the nq-dev.qa.csv
file should be used.
I am confused about which file to use when validate the retrieval results in example_dpr.md
. As far as I know DPR use the nq-dev.qa.csv
that has 8757 queries for validation. So for a fair comparison, i think we should also use the same file as DPR instead of using file with 6.6k questions.
Concretely, which dev file the top-k accuracy below corresponds to? nq-dev.qa.csv
with 8757 questions? or filtered biencoder-nq-dev.json
with 6489 questions.
Following the original dpr work, all the evaluation was on test set.
--output_dir=temp \
--model_name_or_path model_nq \
--fp16 \
--per_device_eval_batch_size 156 \
--dataset_name Tevatron/wikipedia-nq/test \
--encoded_save_path query_emb.pkl \
--encode_is_qry
here we are encoding test set question for evaluation
I have the same question about how the Tevatron/msmarco-passage dataset was created @MXueguang
Hi @Y1Jia , Tevatron/msmarco-passage data is created from https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco#get-data
Thank you for getting back to me so fast!
Hi :)
Regarding the msmarco passages dataset that gets downloaded from the hugginface (https://huggingface.co/datasets/Tevatron/msmarco-passage/tree/main).
How was this dataset created? as it doesnt match any of the dataset on the original microsoft site(https://microsoft.github.io/msmarco/Datasets.html)
Thanks in advance