Closed Mukhammadsaid19 closed 1 year ago
Can you share your training script/yaml?
Closing due to inactivity.
If I may add to this:
I have the problem too after pretraining the hf-version of BERT on all 8 GPUs of a DGX A100 node. The problem is that the following dictionary is empty:
state_dict['state']['integrations']
Strangly, if I only use 1 GPU of a DGX A100 node, everything is fine. My workaround at the moment is simply to load the state_dict['state']['integrations']
of the 1GPU run when I try to save a HF model from the 8GPU run checkpoint, but that is obviously not ideal.
I append my training script and yaml, maybe it helps. Note that I had to upload them as ".txt" files, as ".sbatch" and ".yml" files are not supported
I think that is a bug in an older version of composer. It should be fixed as of composer version 0.15
I use version 0.15.1.
After the training has been done, I tried to use a utility function
write_huggingface_pretrained_from_composer_checkpoint
to save a HF model from a checkpoint. However, it throws a key error, that it can't findhuggingface
field inside binary model.Seems like the saved binaries have their
integrations
field empty. I didn't change any configs inside a Trainer. What might go wrong?