RAG - reproducing RAG-Sequence QA score

acslk commented 3 years ago

I'm trying to reproduce RAG-Sequence NQ score of 44.5 presented in Table 1 of the paper at https://arxiv.org/abs/2005.11401.

I used the command in the examples/rag readme

python examples/rag/eval_rag.py \
    --model_name_or_path facebook/rag-sequence-nq \
    --model_type rag_sequence \
    --evaluation_set path/to/test.source \
    --gold_data_path path/to/gold_data \
    --predictions_path path/to/e2e_preds.txt \
    --eval_mode e2e \
    --gold_data_mode qa \
    --n_docs 5 \
    --print_predictions \
    --recalculate \

For gold_data_path I used data.retriever.qas.nq-test from DPR repo, consisting of 3610 questions and answers: https://github.com/facebookresearch/DPR/blob/master/data/download_data.py#L91-L97

For evaluation_set, my understanding it should be the questions, so I extracted just the questions from the qas.nq-test csv file.

I tried the above command with n_docs 5 and 10, with the following results:

n_docs 5 INFO:main:F1: 49.67 INFO:main:EM: 42.58

n_docs 10 INFO:main:F1: 50.62 INFO:main:EM: 43.49

With n_docs 10 it's still 1 point below the score in paper. What would be the proper setup to reproduce the number, is the pretrained model loaded different, higher n_docs, or different test data?

Thanks in advance!

patrickvonplaten commented 3 years ago

Gently pinging @ola13 here, she probably knows best which command to run to reproduce the eval results :-)

ola13 commented 3 years ago

Hi @acslk, thanks for your post!

You should be able to reproduce paper results for the RAG Token model (44.1 EM on NQ) by evaluating facebook/rag-token-nq with 20 docs.

As for the RAG Sequence model - we have lost some quality when translating the checkpoint from fairseq (the experimentation framework we used to obtain the original paper results) to HuggingFace. We are now working on replicating the paper numbers in HF and we'll update the official facebook/rag-sequence-nq model weights once we have that so stay tuned!

acslk commented 3 years ago

Thanks for the response, I tried the command above with RAG Token model and n_docs 20 on NQ test set and can confirm it matches paper results: INFO:main:F1: 51.44 INFO:main:EM: 44.10

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

huggingface / transformers

RAG - reproducing RAG-Sequence QA score #7465