Memory Leak in Deberta (v1) Base

avacaondata commented 3 years ago

Environment info

transformers version: 4.6.0.dev0
Platform: Linux-5.4.0-1047-aws-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.8.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: YES
Using distributed or parallel set-up in script?: NO

Who can help

@patrickvonplaten @LysandreJik @sgugger

Information

Model I am using (Bert, XLNet ...):

I am using a Deberta-base. First I've pre-trained it on >630M texts in spanish, with a BPE tokenizer trained on the same corpus, which in total is 590M (I've performed more than one epoch), using MLM-WWM. Then, I'm trying to use this model on fine-tuning, but I'm facing some issues. First of all, Deberta is supposed to be much better than Bert and Roberta, however I'm experiencing a very bad performance, when compared to the other spanish model: dccuchile/bert-base-spanish-cased (BETO from now on), which supposedly has a worse architecture and which is trained only slightly more than my model. I've tried with many different hyperparameters, following the recommendations in Deberta paper, without improving results. For reference, in a problem where BETO (the model above) achieves 0.97, I'm achieving 0.91 at best. Moreover, as I'm training many models for hyperparameter search (without using your hyperparameter search api), I see that with each new Deberta model the GPU memory usage increases, which doesn't happen with BETO. I think this is a memory leak in the implementation of Deberta, or at least in the token classification and sequence classification layers of deberta. I don't know if this inefficient implementation leading to this memory leak can have any relationship with the poor performance of the model. Could you please take a look at it?

I hope the architecture itself is not wrongly coded, because otherwise we've spent thousands of dollars in training a spanish model from scratch for nothing. Please, if I can give any further information that can help in clear this out, let me know. I'm a little anxious over here because the results aren't as expected and because there are clear signs that the Deberta implementation has, at least, a memory management problem.

The problem arises when using:

[ ] the official example scripts: (give details below)
[ X] my own modified scripts: (give details below): My script consist of a loop for training different versions of my spanish Deberta Model on a dataset (each version is the same model with different hyperparameters).

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X ] my own task or dataset: (give details below): I've tried with PAWS-X, ConLL2002, Ehealth_kd, Muchocine. All these datasets were downloaded from the datasets library.

To reproduce

Steps to reproduce the behavior:

Use the deberta-base model and fine-tuning on a given dataset (it doesn't matter which one)
Create a hyperparameter dictionary and get the list of hyperparameters for each run with list(sklearn.ParameterGrid(search_dic))
Train the model with trainer using in each run the hyperparameters from the above list. As each model is trained, you will see an increment in memory usage even after doing torch.cuda.empty_cache().

Expected behavior

it is expected, given the results reported on Deberta paper, that Deberta-base works better than Bert-base with less training (the architecture of BETO), therefore I wouldn't expect that after training for almost as long as BETO we have much worse results than it. Also, it is expected that after each run with trainer, and after deleting the trainer from memory with del Trainer, and releasing gpu memory with torch.cuda.empty_cache(), the gpu memory usage is not increased from run to run, as with other model architectures this doesn't happen, and with Deberta it does.

LysandreJik commented 3 years ago

Hello @alexvaca0, thank you for opening an issue! Is there a way for you to provide a collab with a reproducer so that we may take a look at the memory issue?

Regarding the very bad performance, and your query that you hope the "architecture itself is not wrongly coded" - rest assured, the architecture was contributed by the author of the model. I believe DeBERTa has historically been hard to pretrain, as I've heard similar reports in the past. Pinging @BigBird01, the author of the model.

Pengcheng, do you have some tips regarding pretaining the DeBERTa model?

I believe the original repository also contains code for model pretraining: https://github.com/microsoft/DeBERTa Have you taken a look at the pretraining script in that repository?

BigBird01 commented 3 years ago

Yes. We already released our code for pre-training and fine-tuning(SiFT) in our public repo. Please take a look at it. By saying it's hard to pre-train, what do you refer to? Do you mean instability or accuracy of the model?

Thanks! Pengcheng

From: Lysandre Debut @.> Sent: Monday, May 10, 2021 12:37 PM To: huggingface/transformers @.> Cc: Pengcheng He @.>; Mention @.> Subject: Re: [huggingface/transformers] Memory Leak in Deberta (v1) Base (#11657)

Hello @alexvaca0https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falexvaca0&data=04%7C01%7CPengcheng.H%40microsoft.com%7C8c8ef4b53adf484a7a0d08d913eafcdf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637562722260776631%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rBueqZcT0kOvwb%2FjxKIs%2B0V2yyfdcWDo3lq2FwyeHOc%3D&reserved=0, thank you for opening an issue! Is there a way for you to provide a collab with a reproducer so that we may take a look at the memory issue?

Regarding the very bad performance, and your query that you hope the "architecture itself is not wrongly coded" - rest assured, the architecture was contributed by the author of the model. I believe DeBERTa has historically been hard to pretrain, as I've heard similar reports in the past. Pinging @BigBird01https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBigBird01&data=04%7C01%7CPengcheng.H%40microsoft.com%7C8c8ef4b53adf484a7a0d08d913eafcdf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637562722260786585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fUm4si06Lmfwyv4bWSHWzM%2FCJXPrJpR2E1q6nIuYDEk%3D&reserved=0, the author of the model.

Pengcheng, do you have some tips regarding pretaining the DeBERTa model?

I believe the original repository also contains code for model pretraining: https://github.com/microsoft/DeBERTa https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2FDeBERTa&data=04%7C01%7CPengcheng.H%40microsoft.com%7C8c8ef4b53adf484a7a0d08d913eafcdf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637562722260786585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TZlKvOgHNmHzuE9orUxy%2BtIG8GGxxdq%2Bfl%2BBOiG8Ctk%3D&reserved=0 Have you taken a look at the pretraining script in that repository?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Ftransformers%2Fissues%2F11657%23issuecomment-837210289&data=04%7C01%7CPengcheng.H%40microsoft.com%7C8c8ef4b53adf484a7a0d08d913eafcdf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637562722260796544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FuCIjvI9f%2Bbub6XqcvVLkH0Kdt3Pc83Z1u6X5t9jilw%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJDNDRXY2CW5JCM7FBNM3MLTNAYVXANCNFSM44Q2XEOA&data=04%7C01%7CPengcheng.H%40microsoft.com%7C8c8ef4b53adf484a7a0d08d913eafcdf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637562722260796544%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Wbspih0vbzvWLFW7wgPzfCCh2i3Bka1%2Bj8bZpRBfvGo%3D&reserved=0.

avacaondata commented 3 years ago

@BigBird01 @LysandreJik Hi, thanks for the quick response to both of you, I really appreciate your help :) Currently I don't think I can find the time to prepare a reproducer, maybe if you have a script for training a model with several configurations in a loop or using Optuna with the hyperparameter search API from Trainer (it also happens there), you can just replace the model string you were using with microsoft/deberta-base. Using one of the example collabs from Transformers would also be useful, as you'd only have to replace the model name. https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb

I'm glad to know that there are no mistakes in the implementation itself, and therefore the only issue to solve is this memory leak.

I've taken a look at Deberta repository, but I don't find a pre-training script; where exactly can I find it?. However, in order not to waste all the money already spent in training the model, I think it'd be more appropriate to continue using Transformers code. I've followed all hyperparameters stated in the paper for Deberta-base for pre-training, these doesn't change in your pre-training script , do they? @BigBird01 Another issue is that there is no SpanWholeWordMaskCollator in Transformers, therefore we are training with Whole Word Masking... do you think this will severely affect the model performance? On the other hand, if you have code for collating batches with Span Whole Word Masking, do you think it would be possible to put that in transformers data_collator.py code and continue training using that new collator? Or this may lead to divergence of the model?

Thank you again, and sorry about all the questions, I've many doubts regarding this subject.

Regards,

Alejandro

LysandreJik commented 3 years ago

@BigBird01 other people have had issues with pretraining, this issue comes to mind: https://github.com/huggingface/transformers/issues/11689

sgugger commented 3 years ago

@alexvaca0 The Transformers library is not intended to become a hsot for data collators specific to all possible tasks so we probably won't add this SpanWholeWordMaskCollator. You can however copy it in any of your script and use it.

avacaondata commented 3 years ago

@sgugger I don't think that collator is so rare, in fact many models such as SpanBERT, ALBERT and DEBERTA use this pre-training setup...

avacaondata commented 3 years ago

Any updates regarding the memory leak? I'm still experiencing it...

LysandreJik commented 3 years ago

Hi @alexvaca0, I am trying to reproduce the memory leak you mention but I do not manage to obtain it. Within a loop I create a new model, TrainingArgument and Trainer, start the training and look at the metrics.

I also tried only running trainer.train() within the loop, and the second iteration gets a slight increase in GPU memory usage but it stabilizes right after.

I've tried with the hyper-parameter search as well (using optuna) but have not managed to replicate the memory increase you mention.

If you have a script (even a large one as long as I can run it locally) or a notebook that reproduces the leak, I would happily take a look at it.

avacaondata commented 3 years ago

I could prepare one, as I cannot share my pre-trained deberta model with you... But I think we could replicate it the following way: retraining the deberta-base english model for some steps more, saving that checkpoint, and then using a hyperparameter search with optuna from that checkpoint, not the official deberta base checkpoint.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

avacaondata commented 3 years ago

I'm still experiencing this issue. For example, if you initialize a Trainer in google colab with deberta-base, and then try to change the trainer for using other model, the gpu memory used by deberta is not released, I mean, if the trainer with deberta used 16GB, when I try to change the trainer and set bert-base, for example, the object is not replaced. This brings me to the same conclusion I claimed above: there must be a memory leak in deberta code, it leaves objects in the gpu that cannot be released. @patrickvonplaten @LysandreJik @sgugger

LysandreJik commented 3 years ago

Hello @alexvaca0! As mentioned above, I have tried to reproduce but I have failed to do so. Would you happen to have a script handy so that we may take a look? You do not need to share your model if you do no wish to any DeBERTa model on the hub should do.

Thank you.

LysandreJik commented 3 years ago

Could this be a fix to your issue? https://github.com/huggingface/transformers/pull/12718

avacaondata commented 3 years ago

Hi @LysandreJik , as soon as I can I'll try to re-install transformers from source and see if #12718 fixes my issue, although it seems to be related to cpu memory, not gpu memory; moreover, I didn't experience this with any other model but deberta-base, with BERT for example it worked smoothly. I'll also prepare a notebook for you to reproduce, as soon as the workload I have enables me to do so. Thanks! :)

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers