huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.4k stars 27.09k forks source link

Saving model in safetensors format through Trainer fails for Gemma 2 due to shared tensors #33807

Open oranshayer opened 1 month ago

oranshayer commented 1 month ago

System Info

Who can help?

@muellerz @SunMarc

Information

Tasks

Reproduction

I am finetuning google/gemma-2-2b and these are the arguments and trainer call:

text_model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b", token=token, attn_implementation='eager')

training_args = TrainingArguments(
    output_dir=args.log_dir,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.train_batch_size,
    per_device_eval_batch_size=args.eval_batch_size,
    warmup_steps=args.warmup_steps,
    learning_rate=args.learning_rate,
    evaluation_strategy="no",
    logging_dir=args.log_dir,
    logging_steps=50,
    save_strategy="steps",
    save_steps=2000,
    report_to="mlflow",
    run_name=args.run_name,
)

trainer = Trainer(
    model=text_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

I am getting the following error when trainer tries to save the model:

RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'text_model.model.embed_tokens.weight', 'text_model.lm_head.weight'}].
            A potential way to correctly save your model is to use `save_model`.

I have currently disabled saving as safetensors through the training arguments: save_safetensors=False,

Expected behavior

Should save in safetensors without raising an error.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Rocketknight1 commented 3 weeks ago

cc @muellerzr @SunMarc !

SunMarc commented 2 weeks ago

Hey @oranshayer, could you share a minimal reproducer ? This shouldn't happens as we make sure to remove the shared tensors prior to saving. Having the entire traceback will also help us figuring out here the problem is. Thanks !