Closed nakasato closed 3 years ago
Pinging @lhoestq and @patrickvonplaten
Hello there,
I am having the exact same issue when trying to finetune rag. I used the masters version of transformers.
I tried a couple of different things like:
They all returned the same documents: git_log.json hparams.pkl
Also, I realized that if the folder with the trained data is empty, the results are the same.
I am not sure if I am doing something wrong with the implementation or I am not just using the hparams correctly.
Thanks in advance
Marcos Menon
Hi ! If I recall correctly the model is saved using pytorch lightning on_save_checkpoint. So the issue might come from the checkpointing config at
Hi, @lhoestq. Thanks for your quick response.
From the log output, I believe the system is not even starting the network training. Hence, I guess this issue is even a step before the saving step - also because I did not change any code provided by the main transformers library.
Another reason for it: the output logs don't change, even when I run the !python finetune_rag.py ...
keeping my data_dir
totally empty. So, I think the system is not training at all or maybe there is a mistake in my input, so the code skips the training.
Anyway, bellow, there's a sample of the training data I'm using. They all have one question per line in the source and the respective expected answer in the target (fine-tune for a QA task).
train.source
How big is the Brazilian coastline?
Why was the Port of Santos known as the port of death in the past?
Which Brazilian state has the largest number of coastal cities?
train.target
7,491 km.
The Yellow Fever.
Bahia state.
Oh ok. Maybe this is because you need the do_train
flag ? See here:
@lhoestq, that's it; it has solved the problem - actually, quite a simple thing.
Since the central ideia of the fine-tune itself is to provide a way to train the model, I guess it'd be nice to have these params shown in the README too - despite of their immediate need, there's no mention of them there.
Anyway, thank you again, @lhoestq.
You're totally right they must be in the README. Feel free to open a PR to add it, if you want to contribute :)
So, that's right. Meanwhile, I'm going to close this issue :)
@nakasato @MMenonJ I am also fine-tuning the RAG for my custom dataset. I am using rag-token model. Although I use an already trained rag, the loss starts around 70. Can you let me know how your loss changes? At what value it starts?
Hi, @shamanez. Sure: in my last training round, with a dataset of ~30MB (for DPR) and 2400 question-answer pairs in the training data for fine-tune, the loss started off at 118.2, and ended at 30.2, after 100 epochs. I'm using a rag-sequence-base model. In different settings I've tried so far, however, it's common to see the same pattern: it starts around ~130 and ends around ~30.
Nevertheless, maybe because of the extreme specificity of my data (abstracts data), or because of the quality of the question-answer pairs I have (which were generated automatically with a T5 model), the final results were a lot nonsense, in this case.
Btw, since you're also working with RAG, perhaps we can exchange our working experience. Feel free to send me an email ;)
Thanks a lot. I did some modifications to RAG .. like end to end training of the retrival. Now the code is allmost finish. I will share it very soon with documentation.
Cool. Good job! ;)
@shamanez hi can you share your code I am struggling with the training of my custom dataset after initializing retrieval can I share my code if someone could help.
Hi there.
Perhaps the following isn’t even a real issue, but I’m a bit confused with the current outputs I got.
I’m trying to fine tune RAG on a bunch of question-answer pairs I have (for while, not that much, < 1k ones). I have splitted them as suggested (train.source, train.target, val.source…). After running the
finetune_rag.py
, the outputs generated were only two files (~2 kB):Is that right? Because I was expecting a big binary file or something like that containing the weight matrices, so I could use them afterwards in a new trial.
Could you please tell me what’s the point I’m missing here?
I provide more details below. Btw, I have two NVIDIA RTX 3090, 24GB each, but they were barely used in the whole process (which took ~3 hours).
Command:
Logs (in fact, it’s strange but the logs even seem to be generated in duplicate - I don’t know why):