DeBERTa runs very slowly

chenmoneygithub commented 1 year ago

DeBERTa runs very slowly on both TPU and GPU:

TPU: for deberta small, it takes 7 hours to finish one epoch for SST-2 (batch_size=32)
GPU: for deberta small, it takes 40min to finish one epoch for SST-2 (batch_size=16)

Comparatively, for BERT small:

TPU: 3 min to finish one epoch for SST-2 (batch_size=32)
GPU: 8 min to finish one epoch for SST-2 (batch_size=16)

GPU might be fine due to the model size diff, but TPU is not behaving normally. My suspicion is there are something not compatible with XLA.

mattdangerw commented 1 year ago

I've noticed I also see way more parameters than the advertised amount from https://github.com/microsoft/DeBERTa.

E.g. according to the github the base variant should have 86M parameters, but I see 183M when I print them out. This appears to be true on huggingface too, so may be unrelated to the issues Chen is pointing out.

mattdangerw commented 1 year ago

@abheesht17 cc on this one too. Let us know if you have any thoughts!

abheesht17 commented 1 year ago

Hey, @mattdangerw, @chenmoneygithub! I am not sure why this is happening, but I may have some answers.

Regarding number of parameters, it's weird that the official repo says that xsmall has 20M parameters. At the time of checkpoint conversion, I checked the number of parameters:

>>> hf_model = AutoModel.from_pretrained(download_var_name)
>>> model_parameters = filter(lambda p: p.requires_grad, hf_model.parameters())
>>> params = sum([np.prod(p.size()) for p in model_parameters])
>>> params
70682112

This is the same number as returned by model.summary() on our model:

...
==================================================================================================
Total params: 70,682,112
Trainable params: 70,682,112
Non-trainable params: 0
__________________________________________________________________________________________________

Now, I was a bit curious and did some calculations. The numbers in brackets are the advertised numbers.

#xsmall
total_params - token_emb_params = 70,682,112 - 49,190,400 = 21,491,712 (22M)

#small
total_params - token_emb_params = 141,304,320 -  98,380,800 = 42,923,520 (44M)

# base
total_params - token_emb_params = 183,831,552 - 98,380,800 = 85,450,752 (86M)

The advertised numbers on the repo omit the token embedding parameters. So, I don't think this should be an issue.

Regarding slow training on TPU, I remember seeing a comment in the HF about this. Check this out: https://github.com/abheesht17/transformers/blob/deberta-expts/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py#L531. This is probably why it's so slow on TPU? I remember discussing this with @mattdangerw while implementing, and we decided to ignore this comment at that time. You can see this issue: https://github.com/huggingface/transformers/issues/18239.

chenmoneygithub commented 1 year ago

Our GPU performance seems to be fine, check this colab for comparison: https://colab.research.google.com/gist/chenmoneygithub/ca38f7132fc17c85511e612d09ed686c/deberta-checks.ipynb

keras-team / keras-nlp

DeBERTa runs very slowly #606