huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.75k stars 27.17k forks source link

Saving Phi 3 vision fails due to tensor sharing #32354

Closed EricLBuehler closed 2 months ago

EricLBuehler commented 5 months ago

Hello and thank you for the great work here!

We are trying to save a Phi 3 vision mode, but are running into some issues saving it as safetensors.

Due to a shared weight, saving unfortunately fails when using safetensors. I was wondering if there is a solution to de-tie the weight? I attempted to de-tie manually by copying the tensor, but that did not work (perhaps I did it incorrectly, as there is another reference?).

Minimum reproducible example

from transformers import AutoModelForCausalLM

model_id = "lamm-mit/Cephalo-Phi-3-vision-128k-4b-alpha"

model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto"
)

model.save_pretrained("out", safe_serialization=True)
Narsil commented 4 months ago

@EricLBuehler Thanks for the report.

https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/tree/main seems to be working fine, so there must be something going on in that specific remote code.

Without a stack trace it's kind of hard to debug though.

I moved the issue to transformers since it seems the error is there, with a stacktrace+error message I might be able to help better.

Also weight sharing is not allowed in safetensors file, if you're lazy you can just call contiguous() on every member of your state dict, but that might have consequences downstream for finetuning for instance. Some type of weight sharing is simply not allowed (like overlapping tensors with None being a superset of the other)

nmoeller commented 3 months ago

I am running in the same Problem, when i am using the Phi-3-Vision of Hugging Face. Our final goal currently is to convert a finetuned Phi3-Vision with LoRa to Onnx and inference it with the onnxruntime-genai.

Maybe some context how did i get here :

  1. We trained an LoRa Adapater for Phi3-Vision
  2. I tried to merge the Lora Adapter and save the Model with the adjusted LoRa Weights
  3. I was always running in the Error posted below
  4. I decided to remove the LoRa Adapter loading and only trying to save the Base Phi3-Vision and got the same error

The script to reproduce :

from transformers import AutoModelForCausalLM

model_id = "microsoft/Phi-3-vision-128k-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto"
)

model.save_pretrained("out", safe_serialization=True)

The Error Message (i do not want to set safe_serialization=false)

Execution failed. User process 'Rank 0' exited with status code 1. Please check log file 'user_logs/std_log_process_0.txt' for error details. Error: Traceback (most recent call last):
  File "save_phi3_vision.py", line 9, in <module>
    model.save_pretrained("out", safe_serialization=True)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2546, in save_pretrained
    raise RuntimeError(
RuntimeError: The weights trying to be saved contained shared tensors [{'model.embed_tokens.weight', 'model.vision_embed_tokens.wte.weight'}] that are mismatching the transformers base configuration. Try saving using `safe_serialization=False` or remove this tensor sharing.
ArthurZucker commented 3 months ago

Hey! I am seeing trust_remote_code=True which means the code being use is potentially not well supported. Try cloning weights of the embed token in the model.vision_embed_tokens.wte manually, could help

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.