Error when loading the trained weight

Artanic30 commented 2 months ago

Hi, thanks for your work! I have reproduced the training base on this codebase and the loss seems normal as listed below,

 {
      "epoch": 2.0,
      "step": 11720,
      "total_flos": 0.0,
      "train_loss": 0.030723787654426068,
      "train_runtime": 112447.3976,
      "train_samples_per_second": 3.335,
      "train_steps_per_second": 0.104
    }

However, when I trying to evaluate the model, I encounter a lot of trouble when loading the model. Specifically, the following lines are hard to run:

 model_path = os.path.expanduser(args.model_path)
 model_name = get_model_name_from_path(model_path)
 tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)

There a few issues:

The model_name should contain llava and lora so the lora weights can be loaded. I have to hardcode the model_name to llava_lora.
After changing the model_name, the code could successfully load the lora weights until builder.py line83. The problem is that following code fail to merge the lora weights and return a original llava 1.5 SFT model.
```
model = model.merge_and_unload()
```

I'm wondering If you could kindly double check the peft version because I think it may be a library version issue. Besides, could you kindly provide any information that maybe helpful for saving the issues.

Artanic30 commented 2 months ago

Sorry for the misleading message above. After careful investigation of model weights and codes, I found out that the main issue is the reproduced model weights. I evaluate the reproduced weights in MM-VET and the evaluation result is only 31.7. Could you kindly provide the trained model weights for 7B model or the training logs so that I can find out the cause for the issue? I already spent a lot of computational resources on this codebase and really hope the code could work. I would appreciate your swift response.

Artanic30 commented 2 months ago

Here is the tensorboard results for the my reproduction, I hope this could provide more information.

pipilurj commented 2 months ago

Hi, I think the training log you pasted looks normal. Could you please make sure that you have successfully loaded the lora weights? because the results you posted looks the same as the baseline LLaVA 1.5 7B model.

Unfortunately I may not currently have the trained weights because I might have lost the checkpoints after the previous rental period of the GPUs. However, I managed to find the result files of an experiment trial. SImilar score can be derived after uploading it to MMvet evaluator.

llava1.5_7b-lora64-lr2e-6_merged.json

pipilurj commented 2 months ago

Could you please check again whether the correct path for lora is set, and the lora weights successfully loaded into the model? I find the results very strange. Besides, could you share the detailed results of the MMVet Bench?

Artanic30 commented 2 months ago

Hi, thanks for reply. I double check the code for loading the lora weights and find the lora weights is successfully loaded. Here is the code for checking the model weights

def compare_model_weights(model1, model2):
    """
    Compare the weights of two PyTorch models and print the delta for each layer.

    Args:
        model1: The first PyTorch model.
        model2: The second PyTorch model.
    """
    # Ensure both models have the same architecture
    assert len(list(model1.state_dict())) == len(list(model2.state_dict())), \
        "The models have different architectures or number of parameters."

    for (name1, param1), (name2, param2) in zip(model1.named_parameters(), model2.named_parameters()):
        assert name1 == name2, f"Parameter names do not match: {name1} != {name2}"

        # Compute the absolute difference between corresponding parameters
        delta = torch.abs(param1 - param2)

        # Compute the max and mean of the delta to summarize the difference
        max_delta = torch.max(delta).item()
        mean_delta = torch.mean(delta).item()
        raw_mean = torch.mean(param1).item()
        if max_delta > 0:

            print(f"Layer: {name1} | Max Delta: {max_delta:.6f} | Mean Delta: {mean_delta:.6f} raw weight mean: {raw_mean:.6f}")

I use this code to evaluate the difference between base model and merged lora model by following codes in llava/model/builder.py

 from peft import PeftModel
         print('Loading LoRA weights...')
          import copy
          model_cp = copy.deepcopy(model)
          model = PeftModel.from_pretrained(model, model_path, device_map="cpu")
          print('Merging LoRA weights...')
          model = model.merge_and_unload()
          print(compare_model_weights(model, model_cp))

Artanic30 commented 2 months ago

I copy part of the results which shows the lora weights are loaded.

[23:38:06.801614] [23:38:06.801607] [23:38:06.801636] Layer: model.layers.19.self_attn.k_proj.weight | Max Delta: 0.000916 | Mean Delta: 0.000046 raw weight mean: -0.000008
[23:38:06.802246] [23:38:06.802239] [23:38:06.802268] Layer: model.layers.19.self_attn.v_proj.weight | Max Delta: 0.001907 | Mean Delta: 0.000075 raw weight mean: -0.000002
[23:38:06.802878] [23:38:06.802871] [23:38:06.802900] Layer: model.layers.19.self_attn.o_proj.weight | Max Delta: 0.000687 | Mean Delta: 0.000059 raw weight mean: -0.000001
[23:38:06.804295] [23:38:06.804288] [23:38:06.804317] Layer: model.layers.19.mlp.gate_proj.weight | Max Delta: 0.000935 | Mean Delta: 0.000051 raw weight mean: -0.000056
[23:38:06.805699] [23:38:06.805692] [23:38:06.805719] Layer: model.layers.19.mlp.up_proj.weight | Max Delta: 0.000839 | Mean Delta: 0.000047 raw weight mean: -0.000005
[23:38:06.807102] [23:38:06.807095] [23:38:06.807125] Layer: model.layers.19.mlp.down_proj.weight | Max Delta: 0.000435 | Mean Delta: 0.000034 raw weight mean: -0.000002
[23:38:06.808003] [23:38:06.807994] [23:38:06.808032] Layer: model.layers.20.self_attn.q_proj.weight | Max Delta: 0.000633 | Mean Delta: 0.000042 raw weight mean: 0.000002
[23:38:06.808640] [23:38:06.808634] [23:38:06.808662] Layer: model.layers.20.self_attn.k_proj.weight | Max Delta: 0.000687 | Mean Delta: 0.000045 raw weight mean: 0.000004
[23:38:06.809276] [23:38:06.809269] [23:38:06.809298] Layer: model.layers.20.self_attn.v_proj.weight | Max Delta: 0.002796 | Mean Delta: 0.000078 raw weight mean: -0.000004
[23:38:06.809909] [23:38:06.809903] [23:38:06.809930] Layer: model.layers.20.self_attn.o_proj.weight | Max Delta: 0.000721 | Mean Delta: 0.000054 raw weight mean: 0.000001
[23:38:06.811329] [23:38:06.811322] [23:38:06.811352] Layer: model.layers.20.mlp.gate_proj.weight | Max Delta: 0.000954 | Mean Delta: 0.000048 raw weight mean: -0.000037
[23:38:06.812741] [23:38:06.812734] [23:38:06.812763] Layer: model.layers.20.mlp.up_proj.weight | Max Delta: 0.001007 | Mean Delta: 0.000046 raw weight mean: -0.000000
[23:38:06.814149] [23:38:06.814143] [23:38:06.814172] Layer: model.layers.20.mlp.down_proj.weight | Max Delta: 0.000477 | Mean Delta: 0.000032 raw weight mean: -0.000000

Artanic30 commented 2 months ago

The detailed metric for MM-VET is here


,rec_gen_know,rec,spat_ocr,spat_math_ocr,rec_spat,ocr,math_ocr,rec_know,rec_gen_know_ocr,rec_gen_ocr_spat,rec_ocr_spat,rec_ocr,spat_know_ocr,rec_know_spat,spat_gen_ocr,rec_math_ocr_spat,total,std,runs
bpo_bpo_821,17.1,73.0,24.6,10.7,55.8,37.5,1.0,22.2,22.5,48.7,14.3,50.0,16.7,50.0,10.0,0.0,31.7,0.0,[31.7]
```

Additionally, I evaluate the reproduced lora model in MME benchmark, it turns out the score is 1372, which is quite different from original LLaVA1.5 7B.

I'm wondering whether the choice of lora rank is important for the final results. Current reproduced checkpoint is trained with lora 32. I will try to reproduce again with lora 64 soon.

Another issue keeps me worried is the training time, I tried to train the lora 32 model with 8 A40 GPUs and it takes about 30 hours to complete for 7B model, which is far from reported 17 hours in paper. If you have any idea on this issue, please let me know.

Thanks again for your precious time.

pipilurj commented 2 months ago

Hi, we are currently running the experiments again, and hopefully will find out the problem by this weekend. Could you also share the detailed MMVET scores you obtained? Thanks.

Artanic30 commented 2 months ago

Hi, here is the predicted file of MM-VET, I modify the temperature to 0 for checking whether the lora is successfully loaded. mmvet_normal_240821.json

The detailed move score is list below:


rec: 37.3
ocr: 22.8
know: 18.9
gen: 20.6
spat: 28.3
math: 6.2
total: 31.7
std: 0.0
runs: [31.7]
```

pipilurj commented 2 months ago

Hi，the results look quite close to the original LLaVA, so I guess there might be something wrong either with the training or the evaluation. I will get back to you after I reproduce the results.

For the running time, please consider using the flash attn version I just updated. There may be some discrepencies because the number of data was updated at the end. However, it should be able to finish in around 20 hours after applying flash attn.

Thanks.

Artanic30 commented 2 months ago

Thanks for the update. I will try the flash attention when my computational resources are allocated.

pipilurj commented 2 months ago

Hi, I reran the experiment again, but with 4 A100-80G GPUs, since I used different batchsizes, there might be a slight difference. Here is the results I got:

I uploaded the lora checkpoints in the following link: https://huggingface.co/renjiepi/BPO-Lora-LLaVA-7B.

However, I suppose there might exist some instability during training that might have caused such a different result on your side. I suggest tuning down the learning rate, or training the model for just 1 epoch and check if the performance gets boosted, I suspect there might have been a collapse during your training trial.

I will run the experiments with different lora ranks and give you the results. Also, I will check whether the data uploaded on huggingface is the correct version.

Artanic30 commented 2 months ago

Hi, I downloaded the checkpoint and evaluated it on my machine. The mmvet results can be reproduced. I think the evaluation code is fine.

For collapse during training, I also trained the model with POVID data and the evaluation results are still similar to the original LLaVA1.5 7B. I think there maybe something wrong with training code or my environment.

Additionally, I compare your training log with my reproduced version

comparison_metrics_subplot

I find the reward metrics seems strange while other metrics are close. Besides, my training seems more unstable. I will try the flash attention to check if it's the cause of the problem. Do you have any idea on this issue?

pipilurj commented 2 months ago

Hi, I just updated the environment file. Hopefully it is caused by the different package versions.

Artanic30 commented 2 months ago

Thanks, I will reinstall the environment and train again.

Artanic30 commented 2 months ago

Hi, I reinstalled the environment and successfully reproduced the results on my server. Thanks for your precious time.

By the way, there are some missing packages such as

pip install deepspeed
pip install numpy==1.26.2 (numpy 2.x will be installed by default and cause an incompatible error)

pipilurj commented 2 months ago

that's great to know! I'll update the requirements in the project file. Thanks!

pipilurj / bootstrapped-preference-optimization-BPO

Error when loading the trained weight #9