Open kovern opened 1 month ago
Oh actually I think this is a known issue - I just didn't have time to fix it sorry :( I'll flag this as an issue though
Oh actually I think this is a known issue - I just didn't have time to fix it sorry :( I'll flag this as an issue though
Thank you very much :) Is there a timeline for when this might happen?
And another important topic: Can you please confirm, that save_pretrained_merged saves the correct model (even if it's not a lora training)? Because if so, I can continue the training, but if there is no way to save the trained model correctly, then it would be completely waste of time and money to continue, till the fix.
+1 See the same problem. Thank you for the future fix!
Btw may I know when will it (roughly) be fixed? This is blocking the training now :(
OOHH wait I get it - I was so confused what the issue was - the issue was Llama 3.2 has tied lm_head and embed_tokens, hence the bug - if you're finetuning the lm_head and embed_tokens, you will get this issue.
save_pretrained_merged
will get you the correct results as well, but yes this is an issue for tied models - I might work on something to help alleviate this issue
Thank you! I guess weight tieing is the problem.
Thanks! Any workaround for how to automatically save the model during training?
@kovern I use safe_serialization=False
, but dnk whether this will introduce other bugs. In other words, not sure the reason why it does not raise exception is because it really understand weight tieing, or because it does not check this.
@fzyzcjy in the TrainingArguments there is no safe_serialization, but now that i've had a closer look i've actually found something interesting: save_safetensors.
tomorrow i'll try that, i hope works well with save_strategy = "steps".
Oh yes, I mean save_safetensors=False
for TrainingArguments (safe_serialization is for other functions iirc)
I do have
save_strategy='steps',
save_safetensors=False,
and it works well
Hello, I tried to train Llama3.2 3B. It's a full finetune, not a lora, but Unsloth always crashes under varying conditions when the model should be saved. Hardware was runpod in all cases, different configurations (H100, A100, RTX6000), latest Unsloth and latest Transformers.
Case 1. Autosave at every N steps. I set save_strategy = "steps", and save_steps = 1000 in UnslothTrainingArguments. At the 1000th step this happened:
Traceback (most recent call last): File "train.py", line 90, in
trainer_stats = trainer.train()
^^^^^^^^^^^^^^^
File "", line 142, in train
File "", line 440, in _fast_inner_training_loop
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2886, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3454, in save_model
self._save(output_dir)
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3525, in _save
self.model.save_pretrained(
File "/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py", line 2793, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 488, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'lm_head.weight', 'model.embed_tokens.weight'}].
A potential way to correctly save your model is to use
save_model
. More information at https://huggingface.co/docs/safetensors/torch_shared_tensorsBut unfortunately the problem is not in the magic number 1000. Actually Unsloth crashes with every other number as well.
Case 2. No autosave at any steps. I tried a lot of things, like delete the config values from UnslothTrainingArguments, or set a number of steps that was greater than the total length of the training. But at the end of the training Unsloth began to autosave the model, and suddenly the same error appeared: Traceback (most recent call last): File "train.py", line 90, in
trainer_stats = trainer.train()
^^^^^^^^^^^^^^^
File "", line 142, in train
File "", line 440, in _fast_inner_training_loop
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2886, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3454, in save_model
self._save(output_dir)
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3525, in _save
self.model.save_pretrained(
File "/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py", line 2793, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 488, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'lm_head.weight', 'model.embed_tokens.weight'}].
A potential way to correctly save your model is to use
save_model
. More information at https://huggingface.co/docs/safetensors/torch_shared_tensorsCase 3. Save at the end. It seemed the only solution is to disable the autosave completely, which is not a wise decision, but I had no other choice. Lo and behold, the training went through, but at the end I got the same error when I tried to save the model manually:
model.save_pretrained('new_model') --> ERROR trainer.model.save_pretrained('new_model') --> ERROR
Case 4. The only way that worked. Finally I tried this, and that's worked, but I'm not at all sure it's the right one: model.save_pretrained_merged('new_model', tokenizer, save_method = "merged_16bit")