huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

RAG performance on Open-NQ dataset much lower than expected #8285

Closed gaobo1987 closed 3 years ago

gaobo1987 commented 3 years ago

❓ Questions & Help

Details

One peculiar finding is that when we ran the rag-sequence-nq model along with the provided wiki_dpr index, all models and index files were used as is, on the open-NQ test split (3610 questions, https://github.com/google-research-datasets/natural-questions/tree/master/nq_open), we observed EM=27.2 performance, which was rather different from that in the paper, namely EM=44.5.

We are baffled. Has anyone seen lower performance using the transformers RAG models?

A link to original question on the forum/Stack Overflow:

LysandreJik commented 3 years ago

Maybe @lhoestq @patrickvonplaten have an idea

patrickvonplaten commented 3 years ago

Hey @gaobo1987,

We checked that the models match the performance as reported in the paper.

Did you run the model as stated in https://github.com/huggingface/transformers/blob/master/examples/rag/README.md ?

lhoestq commented 3 years ago

Which index did you use exactly with wiki_dpr ? This EM value is expected if you used the compressed one. For the exact one you might need to increase the efSearch parameter of the index. I ran some indexing experiments recently and I'll update the default parameters of the wiki_dpr index with the optimized ones that reproduce RAG's paper results.

EDIT: they've been updated a few weeks ago

gaobo1987 commented 3 years ago

Hey @gaobo1987,

We checked that the models match the performance as reported in the paper.

Did you run the model as stated in https://github.com/huggingface/transformers/blob/master/examples/rag/README.md ?

Thanks for your reply @patrickvonplaten ,

we did not use the example run script there, but followed the code snippets provided in the huggingface documentation:

from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
import torch
tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
retriever = RagRetriever.from_pretrained("facebook/rag-sequence-nq", index_name="exact", use_dummy_dataset=True)
# initialize with RagRetriever to do everything in one forward call
model = RagSequenceForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)
input_dict = tokenizer.prepare_seq2seq_batch("How many people live in Paris?", "In Paris, there are 10 million people.", return_tensors="pt")
input_ids = input_dict["input_ids"]
outputs = model(input_ids=input_ids, labels=input_dict["labels"])
# or use retriever seperately
model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq", use_dummy_dataset=True)
# 1. Encode
question_hidden_states = model.question_encoder(input_ids)[0]
# 2. Retrieve
docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
doc_scores = torch.bmm(question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)).squeeze(1)
# 3. Forward to generator
outputs = model(context_input_ids=docs_dict["context_input_ids"], context_attention_mask=docs_dict["context_attention_mask"], doc_scores=doc_scores, decoder_input_ids=input_dict["labels"])

see here: https://huggingface.co/transformers/model_doc/rag.html#ragsequenceforgeneration

We did use our own evaluation script for computing EM scores.

In general, we tried to follow the prescribed steps from official source as exactly as possible, as for the customized EM calculation, difference may arise there, but I believe the main source of performance difference lies somewhere else.

gaobo1987 commented 3 years ago

Which index did you use exactly with wiki_dpr ? This EM value is expected if you used the compressed one. For the exact one you might need to increase the efSearch parameter of the index. I ran some indexing experiments recently and I'll update the default parameters of the wiki_dpr index with the optimised ones that reproduce RAG's paper results.

thanks for the reply @lhoestq , we used the "exact" mode of the wiki_dpr index, indeed, we haven't tried the "compressed" mode, nor did we tune the "exact" index. Thanks for the update, we will check the "compressed" alternative, and the parameter tuning of the "exact" index. Also great to know that you will update the default parameters!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gaobo1987 commented 3 years ago

Hi, to provide an update on this issue. Recently I refactored my own RAG code based transformers-4.1.1, and obtained EM=40.7 performance on the open NQ dataset with rag-sequence-nq model (n_beams=4) and FAISS HNSW index with n_docs=5, efSearch=256 and efConstruction=200. Unfortunately it still didn't reach the expected 44.5 score. Are these sound parameters? Am I missing any? What is the best parameter combination used at Huggingface? Any advice is much appreciated, thanks! (Note that I couldn't use the original rag code as there is firewall restrictions on my server that prevented downloading the wiki_dpr.py script as well the arrow files for exact indexing, so I have to download these files on a much less powerful laptop and upload them to my server. Consequently, I am using a modified version of RagSequenceForGeneration along with a modified RagRetriever) @lhoestq

krishanudb commented 3 years ago

@gaobo1987 Can you please share how exactly you played around with the efSearch and efConstruction parameters?

As in where in the code did you make the changes??

gaobo1987 commented 3 years ago

hello @krishanudb , thanks for your reply. What I did is merely manually downloading the wiki_dpr-train.arrow file, then use it to construct a faiss hnsw index with efSearch=256, efConstruction=200, then save this index to disk. I wrote a wrapper around RagRetriever and RagSequenceForGeneration respectively so that rag can run directly on the aforementioned faiss index, instead of relying on huggingFace.Datasets utilities and other caching sub-routines. I did not change the models in any way. Could you provide an answer to my question regarding the best combination of parameters from huggingFace to reach the performance as reported in the original paper? Thanks for your time

krishanudb commented 3 years ago

@gaobo1987 There are several versions of the DPR model (single-nq vs multiset) as well as the precomputed passages wiki_dpr I am not sure which one the authors used to get 44% EM but I think they have used the single-nq models for the tasks.

Make sure that you are using the 'right; model. Maybe the authors can shed more light on this..

Even I am facing the same issue... Not getting more than 40% EM no matter if I use the multiset or the nq-single models..

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.