haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.36k stars 2.25k forks source link

[Usage] Not able to save model weights after Lora Finetuning #1734

Open narayanasastry-rvds opened 1 month ago

narayanasastry-rvds commented 1 month ago

Describe the issue

Issue: Lora finetuning with Zero2.json and also Zero3.json. During finetuning, train and validation loss are reducing but when I see the weights in saved model checkpoint, it has only initialised weights.

Command:

Model finetuning is done using: sh finetune_task_lora.sh

Log:

With **Zero3.json**, all weights are same as initialisation:
Vision Lora A mean values: tensor(5.9843e-05, device='cuda:0', dtype=torch.bfloat16)
Non-vision Lora A mean values: tensor(5.1260e-05, device='cuda:0', dtype=torch.bfloat16)
Vision Lora B sum values: tensor(0., device='cuda:0', dtype=torch.bfloat16)
Non-vision Lora B sum values: tensor(0., device='cuda:0', dtype=torch.bfloat16)

With **Zero2.json**, only weights of Lora-B Non-vision values have changed from initialisation:
Vision Lora A mean values: tensor(5.9843e-05, device='cuda:0', dtype=torch.bfloat16)
Non-vision Lora A mean values: tensor(5.1260e-05, device='cuda:0', dtype=torch.bfloat16)
Vision Lora B sum values: tensor(0., device='cuda:0', dtype=torch.bfloat16)
Non-vision Lora B sum values: tensor(**2.1562**, device='cuda:0', dtype=torch.bfloat16)

Screenshots: Train and Validation loss during Model finetuning in below plot: image

Silverasdf commented 1 month ago

I had the same issue. Your checkpoint is, by default, saved in "./checkpoints". Then I ended up just checking the modification times to find the right one, lol. Then, you can merge.

Hope this helps.

narayanasastry-rvds commented 1 month ago

Thanks for the reply. I have been passing correct checkpoints folder only. I haven't been using merge script from here as it is throwing error like this:

OSError: LLava_fine_tune/checkpoints/llava-v1.6-7b-lora/checkpoint-159 does not appear to have a file named config.json. Checkout 'https://huggingface.co/LLava_fine_tune/checkpoints/llava-v1.6-7b-lora/checkpoint-159/tree/main' for available files.

I think it is referring to Huggingface for the checkpoint but I have it on my local. Do you know how to fix this?

Silverasdf commented 1 month ago

I had this issue too. Look into the script: finetune_task_lora.sh. Copy and paste what you have for your model within that script, and it should work. I had to do that. Also, perhaps you are giving the wrong relative path? Try giving it an absolute path for your checkpoints.

narayanasastry-rvds commented 4 weeks ago

Thanks for the reply again. Now, I am able to merge the checkpoints. But next problem is: Model is not giving any output after "Assistant: ". Whereas base model is generating some response.