microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 215 forks source link

Why does the size of DeBERTaV3 double on disk after finetuning? #106

Closed nadahlberg closed 1 year ago

nadahlberg commented 2 years ago

On HF, deberta-v3-large is 800mb: https://huggingface.co/microsoft/deberta-v3-large

But after even a few steps of MLM training the saved model is 1.6gb: https://colab.research.google.com/drive/1PG4PKYnye_F1We2i7VccQ4nYn_XTHhKP?usp=sharing

This seems true of many other finetuned versions of DeBERTaV3 on HF (for both base and large size). It also doesn't seem specific to MLM: https://huggingface.co/navteca/nli-deberta-v3-large https://huggingface.co/cross-encoder/nli-deberta-v3-base/tree/main

Any idea why this is -- is it something to do with V3 itself? And does anyone know if the model size can be reduced again after traning?

Thanks!

darraghdog commented 1 year ago

I guess it is because you are saving the gradients of the model, as well as the weights of the model. For caching training steps, to resume later, gradients are needed. But they would not be there on the hf pretrained models.

nadahlberg commented 1 year ago

@darraghdog thanks for the suggestion! I checked it out and it doesn't seem like these finetuned models have grads. After poking around a bit I think the issue is that the original seems to have been trained with mixed precision whereas from_pretrained defaults to loading the model with float32 tensors.

Will mark this as closed, but to anyone else who comes across this and wants to preserve the original model size after finetuning you can load the model as such:

model = AutoModel.from_pretrained('microsoft/deberta-v3-base', torch_dtype=torch.float16)