[RAG] Expected RAG output after fine tuning #10557

Closed nakasato closed 3 years ago

nakasato commented 3 years ago

Hi there.

Perhaps the following isn’t even a real issue, but I’m a bit confused with the current outputs I got.

I’m trying to fine tune RAG on a bunch of question-answer pairs I have (for while, not that much, < 1k ones). I have splitted them as suggested (train.source,, val.source…). After running the, the outputs generated were only two files (~2 kB):

Is that right? Because I was expecting a big binary file or something like that containing the weight matrices, so I could use them afterwards in a new trial.

Could you please tell me what’s the point I’m missing here?

I provide more details below. Btw, I have two NVIDIA RTX 3090, 24GB each, but they were barely used in the whole process (which took ~3 hours).


python \
    --data_dir rag_manual_qa_finetuning \
    --output_dir output_ft \
    --model_name_or_path rag-sequence-base \
    --model_type rag_sequence \
    --gpus 2 \
    --distributed_retriever pytorch

Logs (in fact, it’s strange but the logs even seem to be generated in duplicate - I don’t know why):

loading configuration file rag-sequence-base/config.json
Model config RagConfig {
  "architectures": [
  "dataset": "wiki_dpr",
  "dataset_split": "train",
  "do_deduplication": true,
  "do_marginalize": false,
  "doc_sep": " // ",
  "exclude_bos_score": false,
  "forced_eos_token_id": 2,
  "generator": {
    "_name_or_path": "",
    "_num_labels": 3,
    "activation_dropout": 0.0,
    "activation_function": "gelu",
    "add_bias_logits": false,
    "add_cross_attention": false,
    "add_final_layer_norm": false,
    "architectures": [
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classif_dropout": 0.0,
    "classifier_dropout": 0.0,
    "d_model": 1024,
    "decoder_attention_heads": 16,
    "decoder_ffn_dim": 4096,
    "decoder_layerdrop": 0.0,
    "decoder_layers": 12,
    "decoder_start_token_id": 2,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "dropout": 0.1,
    "early_stopping": false,
    "encoder_attention_heads": 16,
    "encoder_ffn_dim": 4096,
    "encoder_layerdrop": 0.0,
    "encoder_layers": 12,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "extra_pos_embeddings": 2,
    "finetuning_task": null,
    "force_bos_token_to_be_generated": false,
    "forced_bos_token_id": null,
    "forced_eos_token_id": 2,
    "gradient_checkpointing": false,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1",
      "2": "LABEL_2"
    "init_std": 0.02,
    "is_decoder": false,
    "is_encoder_decoder": true,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1,
      "LABEL_2": 2
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 1024,
    "min_length": 0,
    "model_type": "bart",
    "no_repeat_ngram_size": 0,
    "normalize_before": false,
    "normalize_embedding": true,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 12,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_past": false,
    "output_scores": false,
    "pad_token_id": 1,
    "prefix": " ",
    "pruned_heads": {},
    "repetition_penalty": 1.0,
    "return_dict": false,
    "return_dict_in_generate": false,
    "scale_embedding": false,
    "sep_token_id": null,
    "static_position_embeddings": false,
    "task_specific_params": {
      "summarization": {
        "early_stopping": true,
        "length_penalty": 2.0,
        "max_length": 142,
        "min_length": 56,
        "no_repeat_ngram_size": 3,
        "num_beams": 4
    "temperature": 1.0,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torchscript": false,
    "transformers_version": "4.4.0.dev0",
    "use_bfloat16": false,
    "use_cache": true,
    "vocab_size": 50265
  "index_name": "exact",
  "index_path": null,
  "is_encoder_decoder": true,
  "label_smoothing": 0.0,
  "max_combined_length": 300,
  "model_type": "rag",
  "n_docs": 5,
  "output_retrieved": false,
  "passages_path": null,
  "question_encoder": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "gradient_checkpointing": false,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    "layer_norm_eps": 1e-12,
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 512,
    "min_length": 0,
    "model_type": "dpr",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 12,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 12,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": 0,
    "position_embedding_type": "absolute",
    "prefix": null,
    "projection_dim": 0,
    "pruned_heads": {},
    "repetition_penalty": 1.0,
    "return_dict": false,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torchscript": false,
    "transformers_version": "4.4.0.dev0",
    "type_vocab_size": 2,
    "use_bfloat16": false,
    "use_cache": true,
    "vocab_size": 30522
  "reduce_loss": false,
  "retrieval_batch_size": 8,
  "retrieval_vector_size": 768,
  "title_sep": " / ",
  "use_cache": true,
  "use_dummy_dataset": false,
  "vocab_size": null

Model name 'rag-sequence-base' not found in model shortcut name list (facebook/dpr-question_encoder-single-nq-base, facebook/dpr-question_encoder-multiset-base). Assuming 'rag-sequence-base' is a path, a model identifier, or url to a directory containing tokenizer files.
Didn't find file rag-sequence-base/question_encoder_tokenizer/tokenizer.json. We won't load it.
Didn't find file rag-sequence-base/question_encoder_tokenizer/added_tokens.json. We won't load it.
loading file rag-sequence-base/question_encoder_tokenizer/vocab.txt
loading file None
loading file None
loading file rag-sequence-base/question_encoder_tokenizer/special_tokens_map.json
loading file rag-sequence-base/question_encoder_tokenizer/tokenizer_config.json
Model name 'rag-sequence-base' not found in model shortcut name list (facebook/bart-base, facebook/bart-large, facebook/bart-large-mnli, facebook/bart-large-cnn, facebook/bart-large-xsum, yjernite/bart_eli5). Assuming 'rag-sequence-base' is a path, a model identifier, or url to a directory containing tokenizer files.
Didn't find file rag-sequence-base/generator_tokenizer/tokenizer.json. We won't load it.
Didn't find file rag-sequence-base/generator_tokenizer/added_tokens.json. We won't load it.
loading file rag-sequence-base/generator_tokenizer/vocab.json
loading file rag-sequence-base/generator_tokenizer/merges.txt
loading file None
loading file None
loading file rag-sequence-base/generator_tokenizer/special_tokens_map.json
loading file rag-sequence-base/generator_tokenizer/tokenizer_config.json
Loading passages from wiki_dpr
Downloading: 9.64kB [00:00, 10.8MB/s]                                           
Downloading: 67.5kB [00:00, 59.5MB/s]                                           
WARNING:datasets.builder:Using custom data configuration psgs_w100.nq.no_index-dummy=False,with_index=False
Downloading and preparing dataset wiki_dpr/psgs_w100.nq.no_index (download: 66.09 GiB, generated: 73.03 GiB, post-processed: Unknown size, total: 139.13 GiB) to /home/usp/.cache/huggingface/datasets/wiki_dpr/psgs_w100.nq.no_index-dummy=False,with_index=False/0.0.0/91b145e64f5bc8b55a7b3e9f730786ad6eb19cd5bc020e2e02cdf7d0cb9db9c1...
Dataset wiki_dpr downloaded and prepared to /home/usp/.cache/huggingface/datasets/wiki_dpr/psgs_w100.nq.no_index-dummy=False,with_index=False/0.0.0/91b145e64f5bc8b55a7b3e9f730786ad6eb19cd5bc020e2e02cdf7d0cb9db9c1. Subsequent calls will reuse this data.
loading weights file rag-sequence-base/pytorch_model.bin
All model checkpoint weights were used when initializing RagSequenceForGeneration.

Some weights of RagSequenceForGeneration were not initialized from the model checkpoint at rag-sequence-base and are newly initialized: ['rag.generator.lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model name 'rag-sequence-base' not found in model shortcut name list (facebook/dpr-question_encoder-single-nq-base, facebook/dpr-question_encoder-multiset-base). Assuming 'rag-sequence-base' is a path, a model identifier, or url to a directory containing tokenizer files.
Didn't find file rag-sequence-base/question_encoder_tokenizer/tokenizer.json. We won't load it.
Didn't find file rag-sequence-base/question_encoder_tokenizer/added_tokens.json. We won't load it.
loading file rag-sequence-base/question_encoder_tokenizer/vocab.txt
loading file None
loading file None
loading file rag-sequence-base/question_encoder_tokenizer/special_tokens_map.json
loading file rag-sequence-base/question_encoder_tokenizer/tokenizer_config.json

Model name 'rag-sequence-base' not found in model shortcut name list (facebook/bart-base, facebook/bart-large, facebook/bart-large-mnli, facebook/bart-large-cnn, facebook/bart-large-xsum, yjernite/bart_eli5). Assuming 'rag-sequence-base' is a path, a model identifier, or url to a directory containing tokenizer files.
Didn't find file rag-sequence-base/generator_tokenizer/tokenizer.json. We won't load it.
Didn't find file rag-sequence-base/generator_tokenizer/added_tokens.json. We won't load it.
loading file rag-sequence-base/generator_tokenizer/vocab.json
loading file rag-sequence-base/generator_tokenizer/merges.txt
loading file None
loading file None
loading file rag-sequence-base/generator_tokenizer/special_tokens_map.json
loading file rag-sequence-base/generator_tokenizer/tokenizer_config.json
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: False, using: 0 TPU cores
INFO:lightning:TPU available: False, using: 0 TPU cores
LysandreJik commented 3 years ago

Pinging @lhoestq and @patrickvonplaten

MMenonJ commented 3 years ago

Hello there,

I am having the exact same issue when trying to finetune rag. I used the masters version of transformers.

I tried a couple of different things like:

They all returned the same documents: git_log.json hparams.pkl

Also, I realized that if the folder with the trained data is empty, the results are the same.

I am not sure if I am doing something wrong with the implementation or I am not just using the hparams correctly.

Thanks in advance

Marcos Menon

lhoestq commented 3 years ago

Hi ! If I recall correctly the model is saved using pytorch lightning on_save_checkpoint. So the issue might come from the checkpointing config at

nakasato commented 3 years ago

Hi, @lhoestq. Thanks for your quick response.

From the log output, I believe the system is not even starting the network training. Hence, I guess this issue is even a step before the saving step - also because I did not change any code provided by the main transformers library.

Another reason for it: the output logs don't change, even when I run the !python ... keeping my data_dir totally empty. So, I think the system is not training at all or maybe there is a mistake in my input, so the code skips the training.

Anyway, bellow, there's a sample of the training data I'm using. They all have one question per line in the source and the respective expected answer in the target (fine-tune for a QA task).


How big is the Brazilian coastline?
Why was the Port of Santos known as the port of death in the past?
Which Brazilian state has the largest number of coastal cities?

7,491 km.
The Yellow Fever.
Bahia state.
lhoestq commented 3 years ago

Oh ok. Maybe this is because you need the do_train flag ? See here:

nakasato commented 3 years ago

@lhoestq, that's it; it has solved the problem - actually, quite a simple thing.

Since the central ideia of the fine-tune itself is to provide a way to train the model, I guess it'd be nice to have these params shown in the README too - despite of their immediate need, there's no mention of them there.

Anyway, thank you again, @lhoestq.

lhoestq commented 3 years ago

You're totally right they must be in the README. Feel free to open a PR to add it, if you want to contribute :)

nakasato commented 3 years ago

So, that's right. Meanwhile, I'm going to close this issue :)

shamanez commented 3 years ago

@nakasato @MMenonJ I am also fine-tuning the RAG for my custom dataset. I am using rag-token model. Although I use an already trained rag, the loss starts around 70. Can you let me know how your loss changes? At what value it starts?

nakasato commented 3 years ago

Hi, @shamanez. Sure: in my last training round, with a dataset of ~30MB (for DPR) and 2400 question-answer pairs in the training data for fine-tune, the loss started off at 118.2, and ended at 30.2, after 100 epochs. I'm using a rag-sequence-base model. In different settings I've tried so far, however, it's common to see the same pattern: it starts around ~130 and ends around ~30.

Nevertheless, maybe because of the extreme specificity of my data (abstracts data), or because of the quality of the question-answer pairs I have (which were generated automatically with a T5 model), the final results were a lot nonsense, in this case.

Btw, since you're also working with RAG, perhaps we can exchange our working experience. Feel free to send me an email ;)

shamanez commented 3 years ago

Thanks a lot. I did some modifications to RAG .. like end to end training of the retrival. Now the code is allmost finish. I will share it very soon with documentation.

nakasato commented 3 years ago

Cool. Good job! ;)

surbhi498 commented 4 months ago

@shamanez hi can you share your code I am struggling with the training of my custom dataset after initializing retrieval can I share my code if someone could help.