RuntimeError: The weights trying to be saved contained shared tensors [{'model.vision_embed_tokens.wte.weight', 'model.embed_tokens.weight'}]

kazuar / Phi3-Vision-ft

Finetuning script for Phi-3-vision, the strong multimodal language model by Microsoft.

Apache License 2.0

0 stars 0 forks source link

RuntimeError: The weights trying to be saved contained shared tensors [{'model.vision_embed_tokens.wte.weight', 'model.embed_tokens.weight'}] #2

Open kazuar opened 3 months ago

kazuar commented 3 months ago

When running the finetune.sh script with my own dataset, I encountered the following error during checkpoint / saving the model:

RuntimeError: The weights trying to be saved contained shared tensors [{'model.vision_embed_tokens.wte.weight', 
'model.embed_tokens.weight'}] that are mismatching the transformers base configuration.
Try saving using `safe_serialization=False` or remove this tensor sharing.

Setting safe_serialization=False resulted in a model that wasn't able to load. @2U1 did you encounter this error? (opened it here because https://github.com/2U1/Phi3-Vision-ft doesn't have an issues tab)

2U1 commented 3 months ago

First, sorry I didn't know I had no issue tab. I've opend the issue tab now.

Does your error occured while full-finetuning? I'm not getting the error for now, however it might be the issue caused by accelerator.
Can you add

from accelerate import Accelerator
a = Accelerator()
a.save_model(trainer.model, output_dir)
trainer.model.config.save_pretrained(output_dir)

in safe_save_model_for_hf_trainer instead trainer.save_model(output_dir).

I'll get on working further with this issue.

kazuar commented 3 months ago

Does your error occured while full-finetuning?

Yes, I'm running the finetune.sh script with my own data (just changed one of the parameters num_train_epochs to 4)

I'm not getting the error for now, however it might be the issue caused by accelerator.

I'll restart the environment tomorrow and try again. Seems like a good idea!

@2U1 thanks for all the help!

2U1 commented 3 months ago

Also, another way is to comment the deepspeed part in the safe_save_model_for_hf_trainer function like this. I couldn't run the full-finetuning for the gpu issue, so I couldn't exactly solve the issue for right now. Let me know if one the following way solve this issue.

# if trainer.deepspeed:
#     from accelerate import Accelerator
#     accelerator = Accelerator()
#     accelerator.wait_for_everyone()
#     torch.cuda.synchronize()
#     # trainer.save_model(output_dir)
#     accelerator.save(trainer.model, output_dir, max_shard_size = '5GB')
#     trainer.model.config.save_pretrained(output_dir)
#     trainer.processor.save_pretrained(output_dir)
#     return

state_dict = trainer.model.state_dict()
if trainer.args.should_save:
    cpu_state_dict = {
        key: value.cpu()
        for key, value in state_dict.items()
    }
    del state_dict
    trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa
    trainer.model.config.save_pretrained(output_dir)