huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Apache License 2.0
128.31k stars 25.46k forks source link

Using accelerate launch FDSP cause weight saved after 2nd time onwards to be incomplete #31034

Open aliencaocao opened 1 month ago

aliencaocao commented 1 month ago

System Info

Who can help?

@pacman100 @muellerz




Train a model with FDSP as configured in accelerate configure First time saved weight OK (doesn't matter when the weight is saved, mid-epoch, first step, end of 100 epochs etc. as long as it's first time) 2nd time saving onwards the weights are magically ~100MB smaller with all the keys BUT no weight in some of them, and wrong shape in others. Causes error when loading: WhatsApp 图像2024-05-26于14 29 13_ca42656f

I have so far tested both SigLIP and OWLv2, both has the same issue. Other models may also. Happens with both safetensor and pytorch.bin. pytorch_model_fdsp.bin is also missing them. I have set state dict to FULL.

Expected behavior

No issue saving

amyeroberts commented 4 days ago

cc @muellerzr @SunMarc