huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.31k stars 25.46k forks source link

Using accelerate launch FDSP cause weight saved after 2nd time onwards to be incomplete #31034

Open aliencaocao opened 1 month ago

aliencaocao commented 1 month ago

System Info

Who can help?

@pacman100 @muellerz

Information

Tasks

Reproduction

Train a model with FDSP as configured in accelerate configure First time saved weight OK (doesn't matter when the weight is saved, mid-epoch, first step, end of 100 epochs etc. as long as it's first time) 2nd time saving onwards the weights are magically ~100MB smaller with all the keys BUT no weight in some of them, and wrong shape in others. Causes error when loading: WhatsApp 图像2024-05-26于14 29 13_ca42656f

I have so far tested both SigLIP and OWLv2, both has the same issue. Other models may also. Happens with both safetensor and pytorch.bin. pytorch_model_fdsp.bin is also missing them. I have set state dict to FULL.

Expected behavior

No issue saving

amyeroberts commented 4 days ago

cc @muellerzr @SunMarc