Tied weights like Llama 3.2 3B cannot save during checkpointing

kovern commented 1 month ago

Hello, I tried to train Llama3.2 3B. It's a full finetune, not a lora, but Unsloth always crashes under varying conditions when the model should be saved. Hardware was runpod in all cases, different configurations (H100, A100, RTX6000), latest Unsloth and latest Transformers.

Case 1. Autosave at every N steps. I set save_strategy = "steps", and save_steps = 1000 in UnslothTrainingArguments. At the 1000th step this happened:

Traceback (most recent call last): File "train.py", line 90, in trainer_stats = trainer.train() ^^^^^^^^^^^^^^^ File "", line 142, in train File "", line 440, in _fast_inner_training_loop File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2886, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3454, in save_model self._save(output_dir) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3525, in _save self.model.save_pretrained( File "/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py", line 2793, in save_pretrained safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"}) File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 286, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 488, in _flatten raise RuntimeError( RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'lm_head.weight', 'model.embed_tokens.weight'}]. A potential way to correctly save your model is to use save_model. More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

But unfortunately the problem is not in the magic number 1000. Actually Unsloth crashes with every other number as well.

Case 2. No autosave at any steps. I tried a lot of things, like delete the config values from UnslothTrainingArguments, or set a number of steps that was greater than the total length of the training. But at the end of the training Unsloth began to autosave the model, and suddenly the same error appeared: Traceback (most recent call last): File "train.py", line 90, in trainer_stats = trainer.train() ^^^^^^^^^^^^^^^ File "", line 142, in train File "", line 440, in _fast_inner_training_loop File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2886, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3454, in save_model self._save(output_dir) File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3525, in _save self.model.save_pretrained( File "/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py", line 2793, in save_pretrained safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"}) File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 286, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/safetensors/torch.py", line 488, in _flatten raise RuntimeError( RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'lm_head.weight', 'model.embed_tokens.weight'}]. A potential way to correctly save your model is to use save_model. More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

Case 3. Save at the end. It seemed the only solution is to disable the autosave completely, which is not a wise decision, but I had no other choice. Lo and behold, the training went through, but at the end I got the same error when I tried to save the model manually:

model.save_pretrained('new_model') --> ERROR trainer.model.save_pretrained('new_model') --> ERROR

Case 4. The only way that worked. Finally I tried this, and that's worked, but I'm not at all sure it's the right one: model.save_pretrained_merged('new_model', tokenizer, save_method = "merged_16bit")

danielhanchen commented 1 month ago

Oh actually I think this is a known issue - I just didn't have time to fix it sorry :( I'll flag this as an issue though

kovern commented 1 month ago

Oh actually I think this is a known issue - I just didn't have time to fix it sorry :( I'll flag this as an issue though

Thank you very much :) Is there a timeline for when this might happen?

And another important topic: Can you please confirm, that save_pretrained_merged saves the correct model (even if it's not a lora training)? Because if so, I can continue the training, but if there is no way to save the trained model correctly, then it would be completely waste of time and money to continue, till the fix.

fzyzcjy commented 1 month ago

+1 See the same problem. Thank you for the future fix!

Btw may I know when will it (roughly) be fixed? This is blocking the training now :(

danielhanchen commented 1 month ago

OOHH wait I get it - I was so confused what the issue was - the issue was Llama 3.2 has tied lm_head and embed_tokens, hence the bug - if you're finetuning the lm_head and embed_tokens, you will get this issue.

save_pretrained_merged will get you the correct results as well, but yes this is an issue for tied models - I might work on something to help alleviate this issue

fzyzcjy commented 1 month ago

Thank you! I guess weight tieing is the problem.

kovern commented 1 month ago

Thanks! Any workaround for how to automatically save the model during training?

fzyzcjy commented 1 month ago

@kovern I use safe_serialization=False, but dnk whether this will introduce other bugs. In other words, not sure the reason why it does not raise exception is because it really understand weight tieing, or because it does not check this.

kovern commented 1 month ago

@fzyzcjy in the TrainingArguments there is no safe_serialization, but now that i've had a closer look i've actually found something interesting: save_safetensors.

tomorrow i'll try that, i hope works well with save_strategy = "steps".

fzyzcjy commented 1 month ago

Oh yes, I mean save_safetensors=False for TrainingArguments (safe_serialization is for other functions iirc)

I do have

            save_strategy='steps',
            save_safetensors=False,

and it works well

unslothai / unsloth

Tied weights like Llama 3.2 3B cannot save during checkpointing #1105