Can't save a trained model as a HuggingFace model

Mukhammadsaid19 commented 1 year ago

After the training has been done, I tried to use a utility function write_huggingface_pretrained_from_composer_checkpoint to save a HF model from a checkpoint. However, it throws a key error, that it can't find huggingface field inside binary model.

Seems like the saved binaries have their integrations field empty. I didn't change any configs inside a Trainer. What might go wrong?

File /opt/conda/lib/python3.10/site-packages/composer/models/huggingface.py:522, in get_hf_config_from_composer_state_dict(state_dict, config_overrides)
    519 if config_overrides is None:
    520     config_overrides = {}
--> 522 hf_config_dict = state_dict['state']['integrations']['huggingface']['model']['config']['content']
    523 # Update the config with any extra args needed
    524 hf_config_dict.update(config_overrides)

KeyError: 'huggingface'

dakinggg commented 1 year ago

Can you share your training script/yaml?

dakinggg commented 1 year ago

Closing due to inactivity.

mscherrmann commented 1 year ago

If I may add to this:

I have the problem too after pretraining the hf-version of BERT on all 8 GPUs of a DGX A100 node. The problem is that the following dictionary is empty: state_dict['state']['integrations']

Strangly, if I only use 1 GPU of a DGX A100 node, everything is fine. My workaround at the moment is simply to load the state_dict['state']['integrations'] of the 1GPU run when I try to save a HF model from the 8GPU run checkpoint, but that is obviously not ideal.

I append my training script and yaml, maybe it helps. Note that I had to upload them as ".txt" files, as ".sbatch" and ".yml" files are not supported

train_script.txt german-fin-hf-bert-base-cased.txt

dakinggg commented 1 year ago

I think that is a bug in an older version of composer. It should be fixed as of composer version 0.15

mscherrmann commented 1 year ago

I use version 0.15.1.

mosaicml / examples

Can't save a trained model as a HuggingFace model #400