Closed nadahlberg closed 1 year ago
I guess it is because you are saving the gradients of the model, as well as the weights of the model. For caching training steps, to resume later, gradients are needed. But they would not be there on the hf pretrained models.
@darraghdog thanks for the suggestion! I checked it out and it doesn't seem like these finetuned models have grads. After poking around a bit I think the issue is that the original seems to have been trained with mixed precision whereas from_pretrained defaults to loading the model with float32 tensors.
Will mark this as closed, but to anyone else who comes across this and wants to preserve the original model size after finetuning you can load the model as such:
model = AutoModel.from_pretrained('microsoft/deberta-v3-base', torch_dtype=torch.float16)
On HF, deberta-v3-large is 800mb: https://huggingface.co/microsoft/deberta-v3-large
But after even a few steps of MLM training the saved model is 1.6gb: https://colab.research.google.com/drive/1PG4PKYnye_F1We2i7VccQ4nYn_XTHhKP?usp=sharing
This seems true of many other finetuned versions of DeBERTaV3 on HF (for both base and large size). It also doesn't seem specific to MLM: https://huggingface.co/navteca/nli-deberta-v3-large https://huggingface.co/cross-encoder/nli-deberta-v3-base/tree/main
Any idea why this is -- is it something to do with V3 itself? And does anyone know if the model size can be reduced again after traning?
Thanks!