Open brunopistone opened 1 week ago
I experienced a similar problem when training with FSDP + Trainer. When DDP instead of FSDP is used, the problem disappears.
I fixed it by using the nightly version of PyTorch as of Oct 16, 2024.
My (quite uneducated) guess is that it is related to this issue. However, simply disabling weight tying did not work for me, while upgrading the pytorch version did.
System Info
SageMaker Training Job: ml.g5.12xlarge Image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311
Who can help?
@ArthurZucker @muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm running a full-fine tuning for Llama 3.2 1B with Amazon SageMaker. This is the script:
train_dataset:
train_dataset[0]["text"] (mock):
When I'm trying to load the model with the following script:
I have the following exception:
Expected behavior
The previously shared script works properly if I'm fine-tuning with mixed precision
bfloat16
, with quantization using bitsandbytes, and with LoRA. I suspect there is something wrong in how the model is saved. The expected behavior is that the model is properly loaded an usable for inference