Using distributed or parallel set-up in script?: FDSP on 5 GPUs in 1 node
Who can help?
@pacman100 @muellerz
Information
[ ] The official example scripts
[X] My own modified scripts
Tasks
[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)
Reproduction
Train a model with FDSP as configured in accelerate configure
First time saved weight OK (doesn't matter when the weight is saved, mid-epoch, first step, end of 100 epochs etc. as long as it's first time)
2nd time saving onwards the weights are magically ~100MB smaller with all the keys BUT no weight in some of them, and wrong shape in others. Causes error when loading:
I have so far tested both SigLIP and OWLv2, both has the same issue. Other models may also. Happens with both safetensor and pytorch.bin. pytorch_model_fdsp.bin is also missing them.
I have set state dict to FULL.
System Info
transformers
version: 4.41.1Who can help?
@pacman100 @muellerz
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Train a model with FDSP as configured in accelerate configure First time saved weight OK (doesn't matter when the weight is saved, mid-epoch, first step, end of 100 epochs etc. as long as it's first time) 2nd time saving onwards the weights are magically ~100MB smaller with all the keys BUT no weight in some of them, and wrong shape in others. Causes error when loading:![WhatsApp 图像2024-05-26于14 29 13_ca42656f](https://github.com/huggingface/transformers/assets/20109683/185601df-6cf8-43f3-953b-9c4abd8879e2)
I have so far tested both SigLIP and OWLv2, both has the same issue. Other models may also. Happens with both safetensor and pytorch.bin. pytorch_model_fdsp.bin is also missing them. I have set state dict to FULL.
Expected behavior
No issue saving