haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.38k stars 2.13k forks source link

[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings #1200

Open zengxingchen opened 6 months ago

zengxingchen commented 6 months ago

Describe the issue

I think the code is trying to resume_from_checkpoint like its a full-parameter fine-tunung checkpoint.

zengxingchen commented 6 months ago
Screenshot 2024-02-29 at 11 40 53
CynthiaChuang commented 6 months ago

I have the same issue. Can anyone tell me how to fix it?

qingyuanxingsi commented 6 months ago

+1

sunhm15 commented 5 months ago

+1

davidhalladay commented 5 months ago

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: https://github.com/huggingface/peft/issues/746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

STARRY2001 commented 4 months ago

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this : image

Linjyan00 commented 4 months ago

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746. In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint. Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this : image

just ignore it

davidhalladay commented 4 months ago

On my end, this compatibility issue only causes errors during testing. Therefore, I maintain two separate conda environments: one for training (with transformers==4.39.3) and one for testing (with transformers==4.37.1). While this setup may seem redundant, it offers a quick solution to address the problem.

user074 commented 4 months ago

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

  1. This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version: pip install transformers==4.39.3
  2. And then we need to update Accelerate as well based on the version of Transformers: pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

Thanks! Solved my issue. I tried to save and load the LoRA checkpoints but had problems for a while

wenyisir commented 4 months ago

I fixed this bug by modifying it: site-packages/deepspeed/runtime/engine.py line 2675 load_module_strict=Fasle

tetsu-kikuchi commented 4 months ago

I am afraid that non_lora_trainables.bin will not be loaded by just setting trainer.train(resume_from_checkpoint=True), because non_lora_trainables.bin is a name only specific to LLaVA and is outside the scope of huggingface. Could anyone clarify this point?

Added : It seems that non_lora_trainables.bin is not even saved at intermediate saving steps (at every args.save_steps iterations). It is saved only when the whole training schedule is ended. In any case, I am afraid that non_lora_trainables.bin will not be loaded by using huggingface APIs, including other ways such as in #1027

Maybe we have to insert a code to load non_lora_trainables.bin in llava/train/train.py, just as is done, for example, in llava/eval/model_vqa.py. I would appreciate comments if I am misunderstanding.