Closed Artanic30 closed 2 months ago
Sorry for the misleading message above. After careful investigation of model weights and codes, I found out that the main issue is the reproduced model weights. I evaluate the reproduced weights in MM-VET and the evaluation result is only 31.7. Could you kindly provide the trained model weights for 7B model or the training logs so that I can find out the cause for the issue? I already spent a lot of computational resources on this codebase and really hope the code could work. I would appreciate your swift response.
Here is the tensorboard results for the my reproduction, I hope this could provide more information.
Hi, I think the training log you pasted looks normal. Could you please make sure that you have successfully loaded the lora weights? because the results you posted looks the same as the baseline LLaVA 1.5 7B model.
Unfortunately I may not currently have the trained weights because I might have lost the checkpoints after the previous rental period of the GPUs. However, I managed to find the result files of an experiment trial. SImilar score can be derived after uploading it to MMvet evaluator.
Could you please check again whether the correct path for lora is set, and the lora weights successfully loaded into the model? I find the results very strange. Besides, could you share the detailed results of the MMVet Bench?
Hi, thanks for reply. I double check the code for loading the lora weights and find the lora weights is successfully loaded. Here is the code for checking the model weights
def compare_model_weights(model1, model2):
"""
Compare the weights of two PyTorch models and print the delta for each layer.
Args:
model1: The first PyTorch model.
model2: The second PyTorch model.
"""
# Ensure both models have the same architecture
assert len(list(model1.state_dict())) == len(list(model2.state_dict())), \
"The models have different architectures or number of parameters."
for (name1, param1), (name2, param2) in zip(model1.named_parameters(), model2.named_parameters()):
assert name1 == name2, f"Parameter names do not match: {name1} != {name2}"
# Compute the absolute difference between corresponding parameters
delta = torch.abs(param1 - param2)
# Compute the max and mean of the delta to summarize the difference
max_delta = torch.max(delta).item()
mean_delta = torch.mean(delta).item()
raw_mean = torch.mean(param1).item()
if max_delta > 0:
print(f"Layer: {name1} | Max Delta: {max_delta:.6f} | Mean Delta: {mean_delta:.6f} raw weight mean: {raw_mean:.6f}")
I use this code to evaluate the difference between base model and merged lora model by following codes in llava/model/builder.py
from peft import PeftModel
print('Loading LoRA weights...')
import copy
model_cp = copy.deepcopy(model)
model = PeftModel.from_pretrained(model, model_path, device_map="cpu")
print('Merging LoRA weights...')
model = model.merge_and_unload()
print(compare_model_weights(model, model_cp))
I copy part of the results which shows the lora weights are loaded.
[23:38:06.801614] [23:38:06.801607] [23:38:06.801636] Layer: model.layers.19.self_attn.k_proj.weight | Max Delta: 0.000916 | Mean Delta: 0.000046 raw weight mean: -0.000008
[23:38:06.802246] [23:38:06.802239] [23:38:06.802268] Layer: model.layers.19.self_attn.v_proj.weight | Max Delta: 0.001907 | Mean Delta: 0.000075 raw weight mean: -0.000002
[23:38:06.802878] [23:38:06.802871] [23:38:06.802900] Layer: model.layers.19.self_attn.o_proj.weight | Max Delta: 0.000687 | Mean Delta: 0.000059 raw weight mean: -0.000001
[23:38:06.804295] [23:38:06.804288] [23:38:06.804317] Layer: model.layers.19.mlp.gate_proj.weight | Max Delta: 0.000935 | Mean Delta: 0.000051 raw weight mean: -0.000056
[23:38:06.805699] [23:38:06.805692] [23:38:06.805719] Layer: model.layers.19.mlp.up_proj.weight | Max Delta: 0.000839 | Mean Delta: 0.000047 raw weight mean: -0.000005
[23:38:06.807102] [23:38:06.807095] [23:38:06.807125] Layer: model.layers.19.mlp.down_proj.weight | Max Delta: 0.000435 | Mean Delta: 0.000034 raw weight mean: -0.000002
[23:38:06.808003] [23:38:06.807994] [23:38:06.808032] Layer: model.layers.20.self_attn.q_proj.weight | Max Delta: 0.000633 | Mean Delta: 0.000042 raw weight mean: 0.000002
[23:38:06.808640] [23:38:06.808634] [23:38:06.808662] Layer: model.layers.20.self_attn.k_proj.weight | Max Delta: 0.000687 | Mean Delta: 0.000045 raw weight mean: 0.000004
[23:38:06.809276] [23:38:06.809269] [23:38:06.809298] Layer: model.layers.20.self_attn.v_proj.weight | Max Delta: 0.002796 | Mean Delta: 0.000078 raw weight mean: -0.000004
[23:38:06.809909] [23:38:06.809903] [23:38:06.809930] Layer: model.layers.20.self_attn.o_proj.weight | Max Delta: 0.000721 | Mean Delta: 0.000054 raw weight mean: 0.000001
[23:38:06.811329] [23:38:06.811322] [23:38:06.811352] Layer: model.layers.20.mlp.gate_proj.weight | Max Delta: 0.000954 | Mean Delta: 0.000048 raw weight mean: -0.000037
[23:38:06.812741] [23:38:06.812734] [23:38:06.812763] Layer: model.layers.20.mlp.up_proj.weight | Max Delta: 0.001007 | Mean Delta: 0.000046 raw weight mean: -0.000000
[23:38:06.814149] [23:38:06.814143] [23:38:06.814172] Layer: model.layers.20.mlp.down_proj.weight | Max Delta: 0.000477 | Mean Delta: 0.000032 raw weight mean: -0.000000
The detailed metric for MM-VET is here
,rec_gen_know,rec,spat_ocr,spat_math_ocr,rec_spat,ocr,math_ocr,rec_know,rec_gen_know_ocr,rec_gen_ocr_spat,rec_ocr_spat,rec_ocr,spat_know_ocr,rec_know_spat,spat_gen_ocr,rec_math_ocr_spat,total,std,runs
bpo_bpo_821,17.1,73.0,24.6,10.7,55.8,37.5,1.0,22.2,22.5,48.7,14.3,50.0,16.7,50.0,10.0,0.0,31.7,0.0,[31.7]
```
Additionally, I evaluate the reproduced lora model in MME benchmark, it turns out the score is 1372, which is quite different from original LLaVA1.5 7B.
I'm wondering whether the choice of lora rank is important for the final results. Current reproduced checkpoint is trained with lora 32. I will try to reproduce again with lora 64 soon.
Another issue keeps me worried is the training time, I tried to train the lora 32 model with 8 A40 GPUs and it takes about 30 hours to complete for 7B model, which is far from reported 17 hours in paper. If you have any idea on this issue, please let me know.
Thanks again for your precious time.
Hi, we are currently running the experiments again, and hopefully will find out the problem by this weekend. Could you also share the detailed MMVET scores you obtained? Thanks.
Hi, here is the predicted file of MM-VET, I modify the temperature to 0 for checking whether the lora is successfully loaded. mmvet_normal_240821.json
The detailed move score is list below:
rec: 37.3
ocr: 22.8
know: 18.9
gen: 20.6
spat: 28.3
math: 6.2
total: 31.7
std: 0.0
runs: [31.7]
```
Hi,the results look quite close to the original LLaVA, so I guess there might be something wrong either with the training or the evaluation. I will get back to you after I reproduce the results.
For the running time, please consider using the flash attn version I just updated. There may be some discrepencies because the number of data was updated at the end. However, it should be able to finish in around 20 hours after applying flash attn.
Thanks.
Thanks for the update. I will try the flash attention when my computational resources are allocated.
Hi, I reran the experiment again, but with 4 A100-80G GPUs, since I used different batchsizes, there might be a slight difference. Here is the results I got:
I uploaded the lora checkpoints in the following link: https://huggingface.co/renjiepi/BPO-Lora-LLaVA-7B.
However, I suppose there might exist some instability during training that might have caused such a different result on your side. I suggest tuning down the learning rate, or training the model for just 1 epoch and check if the performance gets boosted, I suspect there might have been a collapse during your training trial.
I will run the experiments with different lora ranks and give you the results. Also, I will check whether the data uploaded on huggingface is the correct version.
Hi, I downloaded the checkpoint and evaluated it on my machine. The mmvet results can be reproduced. I think the evaluation code is fine.
For collapse during training, I also trained the model with POVID data and the evaluation results are still similar to the original LLaVA1.5 7B. I think there maybe something wrong with training code or my environment.
Additionally, I compare your training log with my reproduced version
I find the reward metrics seems strange while other metrics are close. Besides, my training seems more unstable. I will try the flash attention to check if it's the cause of the problem. Do you have any idea on this issue?
Hi, I just updated the environment file. Hopefully it is caused by the different package versions.
Thanks, I will reinstall the environment and train again.
Hi, I reinstalled the environment and successfully reproduced the results on my server. Thanks for your precious time.
By the way, there are some missing packages such as
pip install deepspeed
pip install numpy==1.26.2 (numpy 2.x will be installed by default and cause an incompatible error)
that's great to know! I'll update the requirements in the project file. Thanks!
Hi, thanks for your work! I have reproduced the training base on this codebase and the loss seems normal as listed below,
However, when I trying to evaluate the model, I encounter a lot of trouble when loading the model. Specifically, the following lines are hard to run:
There a few issues:
I'm wondering If you could kindly double check the peft version because I think it may be a library version issue. Besides, could you kindly provide any information that maybe helpful for saving the issues.