huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.87k stars 26.5k forks source link

Low retrieval and generation performance if evaluate rag model using consolidate_rag_checkpoint initialized with BART-LARGE as generator #31349

Closed Rakin061 closed 2 months ago

Rakin061 commented 3 months ago

System Info

Who can help?

@ArthurZucker @younesbelkada @LysandreJik

Information

Tasks

Reproduction

!python consolidate_rag_checkpoint.py \
    --model_type rag_sequence \
    --generator_name_or_path facebook/bart-large \
    --question_encoder_name_or_path facebook/dpr-question_encoder-single-nq-base \
    --dest checkpoint_dpr_bart_dpr-qe
!python finetune_rag.py \
    --data_dir QA_dataset  \
    --output_dir checkpoint_dpr_final_bart_large_qe \
    --model_name_or_path checkpoint_dpr_bart_dpr-qe \
    --model_type rag_sequence \
    --fp16 \
    --gpus 1 \
    --profile \
    --do_train \
    --do_predict \
    --train_batch_size 2 \
    --eval_batch_size 1 \
    --num_train_epochs 1 \
    --index_name custom \
    --passages_path output_dataset_index_dpr/test_dataset \
    --index_path output_dataset_index_dpr/test_dataset_hnsw_index.faiss
! python eval_rag.py \
     --model_name_or_path checkpoint_dpr_final/checkpoint2 \
     --model_type rag_token \
     --evaluation_set evaluation_results/biencoder-nq-dev.questions \
     --gold_data_path evaluation_results/biencoder-nq-dev.answers \
     --predictions_path evaluation_results/e2e_preds.txt \
     --eval_mode e2e \
     --gold_data_mode ans \
     --n_docs 5 \
     --max_length 20 \
     --print_predictions \
     --recalculate

Expected behavior

Got accuracy of F1 score and EM score over 80 when I evaluate directly from rag-sequence-nq.

But, when I split the rag-sequence-nq model with bart-large and dpr-question_encoder-single-nq-base models separately and then consolidate the checkpoint for evaluation, then I'm getting score less than 5 for each metrics with same test set of questions.

It's pretty surprising as rag-sequence-nq also comprises of bart-large and dpr-question_encoder-single-nq-base but still lacks performance when loaded separately with the consolidate_rag_checkpoint.py script.

Need attention to the consolidate_rag_checkpoint.py script while saving the models as safetensors.

Unexpected Behaviors:

  1. Retrieval performance got lower as most of the times model is picking up wrong documents.
  2. Generation performance is even lower. Instead of generation words, model is generation whole sentence with lexical similarity.

Example:

Q: In which year Nasif got Turing Prize ? A: Sakib Reza / Sakib, a 25-year-old BCS Cadre, was born and raised in Bangladesh, where he developed a deep love for cricket, a passion that remains integral to his life. Sakib

ArthurZucker commented 3 months ago

Hey! Not sure we will have time to dive into this, maybe adding a piece of doc to make sure poeple use the rag-sequence-nq not splitted would be nice?

Rakin061 commented 3 months ago

Hey! Not sure we will have time to dive into this, maybe adding a piece of doc to make sure poeple use the rag-sequence-nq not splitted would be nice?

Hello,
Actually the purpose of the consolidate_rag_checkpoint.py script is to initialize finetuning with a base model using different question encoder and generator architectures. In that case alternative solutions will be needed if I want to train RAG models on different generator other than BART or downstream question encoder instead of forcing to use rag-sequence-nq as a composite one.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.