Open KosumosuL opened 1 year ago
Even I am curious about this
Hi @KosumosuL @MrigankRaman
We are seeing the loss of Vicuna v1-1 being larger than Vicuna v0 models as well. Although the qualitative results look good, we are also investigating the reasons. One possible reason is that in v0 prompts, there are end of sequence token "###" after each sentence, while in v1, it is only added after the GPT response. Furthermore, "###" can sometimes be tokenized as "#" and "##" (two tokens instead of one token).
Given that "###" is really easy for model to optimize to predict, it may be the reason why the loss of v0 model is lower than v1.
If you have other better explanations or insights, please let me know. I'll update if there are more findings as well.
Thanks!
Question
Here is my training script:
I use the llama-vicuna-7b-v1.1 got from FastChat, and finetune it with CC595K image-text pairs, with version v1 conversation format, the losses are keeping around 2.5 and hard to decrease. Yet when useing llama-vicuna-7b with version v0, the losses are repaidly decreased to 1.3~1.5.
Is there any bugs in the latest code? Or the loss is just normal for version v1?