Closed MoritzLaurer closed 2 years ago
For mDeBERTa, you need to use fp32. There is a fix in our official repo and we are going to port the fix to transformers soon.
Cool, this means that after the fix I can use fp16 as well?
Is there an update on this? I don't think an updated was pushed to the huggingface hub: https://huggingface.co/microsoft/mdeberta-v3-base/commits/main
Would be great to be able to use it with FP16
Have you figured it out, guys?
ValueError: Tokenizer class DebertaV2Tokenizer does not exist or is not currently imported. any idea please share thanks in adavnce
@BigBird01 do you have any update on this by chance?
@jtomek @abdullahmuaad9 @BigBird01 @MoritzLaurer @barschiiii any fix for this . I get the NaN with m deberta
Hello team! Is there any update on this? @jtomek @abdullahmuaad9 @BigBird01 @MoritzLaurer @barschiiii @jaideep11061982 Thanks!
Yes Can you tell me which kind of update thank you
On Tue, Apr 4, 2023 at 7:47 PM rfbr @.***> wrote:
Hello team! Is there any update on this? @jtomek https://github.com/jtomek @abdullahmuaad9 https://github.com/abdullahmuaad9 @BigBird01 https://github.com/BigBird01 @MoritzLaurer https://github.com/MoritzLaurer @barschiiii https://github.com/barschiiii Thanks!
— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeBERTa/issues/77#issuecomment-1496054627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIXFRF3EEZYVOTZDLIOGKJDW7QUN3ANCNFSM5JLMVEUA . You are receiving this because you were mentioned.Message ID: @.***>
-- Abdullah Y Muaad Research Scholar Department of Studies, Computer Science University of Mysore Mysore- Karnataka- India.
Researchgate https://www.researchgate.net/profile/Abdullah-Muaad-2| Twitter https://twitter.com/abdullahmuaad1 | LinkedIn https://www.linkedin.com/company/hindawi | Google Scholar https://scholar.google.com/citations?user=ZbX-qJ0AAAAJ&hl=en Mob.No: 00919148249825
I pinged you just in case you were interested by the future answer from the Microsoft team on the possibility to use fp16 with mDeBERTa.
Hello there! I have tracked the different modules to find where the under/overflows are happening. The DisentangledSelfAttention module is the culprit, replacing it with the implementation in this repo fixed the issue (I haven't spend the time to find the specific operation causing the NaN).
Hey @rfbr I tried updating the DisentangledSelfAttention module in HF transformers with the one in this repo, but when fine-tuning on extractive QA (on squad 2.0) with fp16 I was still getting Nan predictions. Do you have an example implementation in the transformers code I could look at?
Update: Actually it seems like I got it to work. It appears the key was calculating the scale
like this (using the math library) https://github.com/microsoft/DeBERTa/blob/4d7fe0bd4fb3c7d4f4005a7cafabde9800372098/DeBERTa/deberta/disentangled_attention.py#L85-L86
instead of whats implemented in transformers
https://github.com/huggingface/transformers/blob/ef42c2c487260c2a0111fa9d17f2507d84ddedea/src/transformers/models/deberta_v2/modeling_deberta_v2.py#L724-L725
scale = torch.sqrt(torch.tensor(query_layer.size(-1), dtype=torch.float) * scale_factor)
attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / scale.to(dtype=query_layer.dtype)
which uses all torch functionality.
Could this be because we aren't calling something like detach
in the transformers code? Or maybe it has to do with the order of operations (e.g. perform the division before the multiplication as is done in this repo).
Hey! Is there any update on this @BigBird01? I'm using the last version of transformers 4.29.2 and I'm still facing the same issue when using fp16. When will you port the fix?
Thanks.
Hey @jplu I think I was able to port the changes into my forked branch of transformers here. If you'd just like to see the git diff so you can try the same take a look here. I did this by comparing the implementation in this repo compared to the one in transformers.
Doing this I was able to get fp16 training working in transformers.
Hey @sjrl! Thanks a lot for sharing this. Indeed I confirm with your code the ability to train with fp16. Did you apply for a PR on the main repo? If not would be nice to have this fix integrated.
@jplu Just opened the PR! I took some time to find the minimal changes needed to get the fp16 training to work. Hopefully that will speed up the review process.
Awesome this seems perfect! Thanks a lot!
This is honestly perfect, @sjrl. What a clever way to solve the problem! 🤩
I really like the DeBERTa-v3 models and the monolingual models work very well for me. Weirdly enough, the multilingual model uploaded on the huggingface hub does not seem to work. I have a code for training multilingual models on XNLI, and the training normally works well (e.g. no issue with microsoft/Multilingual-MiniLM-L12-H384), but when I apply the exact same code to mDeBERTa, the model does not seem to learn anything. I don't get an error message, but the training results look like this:
I've double checked by running the exact same code on multilingual-minilm and the training works, which makes me think that it's not an issue in the code (wrongly formatting the input data or something like that), but something went wrong when uploading mDeBERTa to the huggingface hub? Accuracy of exactly random 0.3333, 0 training loss at epoch 2 and NaN validation loss maybe indicates that the data is running through the model, but some parameters are not updating or something like that?
My environment is google colab; Transformers==4.12.5