mDeBERTa on HuggingFace hub does not seem to work

MoritzLaurer commented 2 years ago

I really like the DeBERTa-v3 models and the monolingual models work very well for me. Weirdly enough, the multilingual model uploaded on the huggingface hub does not seem to work. I have a code for training multilingual models on XNLI, and the training normally works well (e.g. no issue with microsoft/Multilingual-MiniLM-L12-H384), but when I apply the exact same code to mDeBERTa, the model does not seem to learn anything. I don't get an error message, but the training results look like this:

I've double checked by running the exact same code on multilingual-minilm and the training works, which makes me think that it's not an issue in the code (wrongly formatting the input data or something like that), but something went wrong when uploading mDeBERTa to the huggingface hub? Accuracy of exactly random 0.3333, 0 training loss at epoch 2 and NaN validation loss maybe indicates that the data is running through the model, but some parameters are not updating or something like that?

My environment is google colab; Transformers==4.12.5

BigBird01 commented 2 years ago

For mDeBERTa, you need to use fp32. There is a fix in our official repo and we are going to port the fix to transformers soon.

MoritzLaurer commented 2 years ago

Cool, this means that after the fix I can use fp16 as well?

MoritzLaurer commented 2 years ago

Is there an update on this? I don't think an updated was pushed to the huggingface hub: https://huggingface.co/microsoft/mdeberta-v3-base/commits/main

Would be great to be able to use it with FP16

jtomek commented 2 years ago

Have you figured it out, guys?

abdullahmuaad9 commented 2 years ago

ValueError: Tokenizer class DebertaV2Tokenizer does not exist or is not currently imported. any idea please share thanks in adavnce

barschiiii commented 2 years ago

@BigBird01 do you have any update on this by chance?

jaideep11061982 commented 1 year ago

@jtomek @abdullahmuaad9 @BigBird01 @MoritzLaurer @barschiiii any fix for this . I get the NaN with m deberta

rfbr commented 1 year ago

Hello team! Is there any update on this? @jtomek @abdullahmuaad9 @BigBird01 @MoritzLaurer @barschiiii @jaideep11061982 Thanks!

abdullahmuaad9 commented 1 year ago

Yes Can you tell me which kind of update thank you

On Tue, Apr 4, 2023 at 7:47 PM rfbr @.***> wrote:

Hello team! Is there any update on this? @jtomek https://github.com/jtomek @abdullahmuaad9 https://github.com/abdullahmuaad9 @BigBird01 https://github.com/BigBird01 @MoritzLaurer https://github.com/MoritzLaurer @barschiiii https://github.com/barschiiii Thanks!

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeBERTa/issues/77#issuecomment-1496054627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIXFRF3EEZYVOTZDLIOGKJDW7QUN3ANCNFSM5JLMVEUA . You are receiving this because you were mentioned.Message ID: @.***>

-- Abdullah Y Muaad Research Scholar Department of Studies, Computer Science University of Mysore Mysore- Karnataka- India.

Researchgate https://www.researchgate.net/profile/Abdullah-Muaad-2| Twitter https://twitter.com/abdullahmuaad1 | LinkedIn https://www.linkedin.com/company/hindawi | Google Scholar https://scholar.google.com/citations?user=ZbX-qJ0AAAAJ&hl=en Mob.No: 00919148249825

rfbr commented 1 year ago

I pinged you just in case you were interested by the future answer from the Microsoft team on the possibility to use fp16 with mDeBERTa.

rfbr commented 1 year ago

Hello there! I have tracked the different modules to find where the under/overflows are happening. The DisentangledSelfAttention module is the culprit, replacing it with the implementation in this repo fixed the issue (I haven't spend the time to find the specific operation causing the NaN).

sjrl commented 1 year ago

Hey @rfbr I tried updating the DisentangledSelfAttention module in HF transformers with the one in this repo, but when fine-tuning on extractive QA (on squad 2.0) with fp16 I was still getting Nan predictions. Do you have an example implementation in the transformers code I could look at?

Update: Actually it seems like I got it to work. It appears the key was calculating the scale like this (using the math library) https://github.com/microsoft/DeBERTa/blob/4d7fe0bd4fb3c7d4f4005a7cafabde9800372098/DeBERTa/deberta/disentangled_attention.py#L85-L86 instead of whats implemented in transformers https://github.com/huggingface/transformers/blob/ef42c2c487260c2a0111fa9d17f2507d84ddedea/src/transformers/models/deberta_v2/modeling_deberta_v2.py#L724-L725

scale = torch.sqrt(torch.tensor(query_layer.size(-1), dtype=torch.float) * scale_factor)
attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / scale.to(dtype=query_layer.dtype)

which uses all torch functionality.

Could this be because we aren't calling something like detach in the transformers code? Or maybe it has to do with the order of operations (e.g. perform the division before the multiplication as is done in this repo).

jplu commented 1 year ago

Hey! Is there any update on this @BigBird01? I'm using the last version of transformers 4.29.2 and I'm still facing the same issue when using fp16. When will you port the fix?

Thanks.

sjrl commented 1 year ago

Hey @jplu I think I was able to port the changes into my forked branch of transformers here. If you'd just like to see the git diff so you can try the same take a look here. I did this by comparing the implementation in this repo compared to the one in transformers.

Doing this I was able to get fp16 training working in transformers.

jplu commented 1 year ago

Hey @sjrl! Thanks a lot for sharing this. Indeed I confirm with your code the ability to train with fp16. Did you apply for a PR on the main repo? If not would be nice to have this fix integrated.

sjrl commented 1 year ago

@jplu Just opened the PR! I took some time to find the minimal changes needed to get the fp16 training to work. Hopefully that will speed up the review process.

jplu commented 1 year ago

Awesome this seems perfect! Thanks a lot!

jtomek commented 1 year ago

This is honestly perfect, @sjrl. What a clever way to solve the problem! 🤩

microsoft / DeBERTa

mDeBERTa on HuggingFace hub does not seem to work #77