gregor-ge / mBLIP

MIT License
84 stars 7 forks source link

Problem with Loading Weights #12

Open Alexwe12 opened 8 months ago

Alexwe12 commented 8 months ago

Hi, thank you for your work. I want to further instruction-tune the mBLIP mt0 model using my data. I have set the blip_pretrained_checkpoint argument to the pytorch_model-00001-of-00002.bin file from the mBLIP repository, and the lm_pretrained argument to the bin files in the mBLIP repository, which correspond to the encoder and decoder, along with the config.json and pytorch_model.bin.index.json files, in the bigscience/mt0-xl repository. However, when I instantiate the mBLIP class and load the weights, I encounter an error in line (https://github.com/gregorge/mBLIP/blob/f804c1f8bb84b13795b71aaaa6fe3f44851c908b/src/modules/modeling/mblip.py#L185C97-L185C97). The process proceeds without issues when I set load_in_8bit=False. But when I call model.generate(**inputs), it produces nonsensical output. Could you please advise on how to train the model based on your provided checkpoint? Thank you very much.

gregor-ge commented 8 months ago

Hi,

thanks for your interest in the project. I understand that the process of further training my checkpoints is tricky so I understand your problem. It is unfortunate that HuggingFace does not yet support loading only parts of a model in 8bit which makes the whole split into the LLM and Vision+Q-Former part necessary.

I was trying to write some brief instructions on how to do what you want but I realized that it is actually quite hard so I updated the code.

First of all, your process is nearly correct. The problem is that loading of the mt0 fails because 1) the checkpoint keys do not match what mt0 expect ("language_model." as prefix) and because 2) the index.json in mt0 is different from the index in my repository. Fixing those two things is possible but annoying. This would explain why it generates gibberish.

With the commit I pushed just now, you can set 'lm_pretrained=Gregor/mblip-mt0-xl' and it should work; blip_pretrained_checkpoint stays as pytorch_model-00001-of-00002.bin. Note that I only tested it briefly if the model loaded this way still produces something correct with generate so let me know if you have other problems.

Also, next time you have some issues, please include the full log and stack trace of the error.

Alexwe12 commented 8 months ago

Hello, thank you very much for your response. Actually, I encountered another error after saving checkpoints which raises an AttributeError: "'GroupedOptimizerTridentModule' object has no attribute 'lightning_module'". I receive this error when I run the instruction-tune experiment. The warm up is okay. Thank you. run_jan.log

gregor-ge commented 8 months ago

If scroll up your stack trace, you will see that your actual error is:

Traceback (most recent call last):
  File "/home/mBLIP/src/tasks/vllm/checkpoint.py", line 50, in _save_checkpoint
    model = trainer.model.model.model
  File "/home/benenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'model'

Unfortunately, I cannot replicate this problem. When I start a training run with ddp, in line 52 trainer.model.module.lightning_module.model.model yields the mBLIP object as it should.

I suggest stopping there with a debugger (you can set trainer.limit_train_batches: 10, trainer.limit_val_batches: 10 to get there quicker) and check what yields the expected object.

The lightning version might be a reason, I use 2.0.1.

Alternatively, you can remove ddp (if you use only one GPU) and see if that works.