microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 215 forks source link

out of memory #109

Open Amazing-J opened 1 year ago

Amazing-J commented 1 year ago

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 550. GiB for an array with shape (28235788,) and data type |S20921 How did you solve this mistake? The pre training corpus is read incorrectly. My memory has 360g.

stefan-it commented 1 year ago

Hi @Amazing-J ,

this maybe doesn't solve the problem, but in my preliminary experiments with v2 pretraining, a 35GB corpus uses around 512GB of CPU RAM -> which is huge, I know.

Workaround could be to implement a sharding-based reading of the pretraining data, like it is e.g. done when using some TensorFlow implementations such as BERT, ALBERT or ELECTRA that are reading TFRecords, which is really memory efficient.

Amazing-J commented 1 year ago

@stefan-it But the paper reports using hundreds of GB of data. How did they do it?

pvcastro commented 1 year ago

@stefan-it were you able to replicate this pre-training with a larger dataset? Runs ok with the wiki-103 sample, but with a greater dataset (35gb before preprocessing and 54gb after preprocessing), it's still loading on a DGX-A100 after 2 days, and it's already taking 1.5Tb RAM, not 512Gb, as you said :cry: I'll soon kill the process.

stefan-it commented 1 year ago

Ah, I used one GPU for my experiments. So in a multi-gpu setting it could be more! I did one experiment where I trained an xsmall v3 model for 500k steps (I only have one gpu available) after they released the pretraining a few weeks ago, but performance was not really great and CPU RAM usage is so high.

But maybe you can create an extra swap partition for that?

pvcastro commented 1 year ago

The training got started right when I was going to kill it :sweat_smile: Took 40 hours for the data loading. @stefan-it do you know if using this same code in the distributed setting will cause this same dataset to take the same amount of RAM in all machines assigned to the training cluster? I guess so, right?

StephennFernandes commented 1 year ago

@pvcastro are you able to train Debertav3-large on a single DGX-A100 ?

I have 4 A6000's but i am unable to get pass the OOM error. the implementation doesnt cover FSDP DeepSpeed zero integrations

pvcastro commented 1 year ago

Hi @StephennFernandes , I did it, but it was extremely slow, the biggest batch size I was able to use was 96 in a single A100 GPU. Right now I'm struggling with the fact that the entire dataset is loaded into RAM. I don't have access to any cluster that can take all that.

StephennFernandes commented 1 year ago

yesterday i just ran it on the 500mb dummy wikipedia dataset that's provided by default. And it completely hogged all the vrams and yeah same here it was extremely slow. Training said it would take 500hrs

Also do check this issue when the training keeps running for quite sometime one gpu just hogs up all the ram and there is CPU load but a bloated vram .... did you notice this issue ?

StephennFernandes commented 1 year ago

Also @pvcastro , what version of torch are you using ? I upgraded to pytorch 2.0 and the torch._six module had dependency issues, so had to refactor the code in dataloader.py to get it running.

pvcastro commented 1 year ago

@StephennFernandes I'm not sure if I used 2.0 or 1.13.1, but I did have to make the _six adjustment you mentioned as well. I also refactored the prepare_data script to not load the whole dataset into memory as well. And no, I didn't notice the issue you mentioned. I just left it training for a couple of days and killed the process. None of the 16 checkpoints it saved during these 2 days were good enough for evaluation, was scoring 0 in benchmarks.

chengming1108 commented 11 months ago

@StephennFernandes I'm not sure if I used 2.0 or 1.13.1, but I did have to make the _six adjustment you mentioned as well. I also refactored the prepare_data script to not load the whole dataset into memory as well. And no, I didn't notice the issue you mentioned. I just left it training for a couple of days and killed the process. None of the 16 checkpoints it saved during these 2 days were good enough for evaluation, was scoring 0 in benchmarks.

hey i also have the same problem,are you done finetune V3? i ues the small Chinese data ,but cannot run .

pvcastro commented 11 months ago

Yes @chengming1108 , I was able to get some results improving the preprocessing and adjusting the hyperparameters, using a large dataset of around 80Gb.

chengming1108 commented 11 months ago

Yes @chengming1108 , I was able to get some results improving the preprocessing and adjusting the hyperparameters, using a large dataset of around 80Gb.

nice work~.I have about 3M Chinese data .torch=1.6;the code connot use GPU,when load data with memery , still use 60G,the code be killed.I have no idear.

pvcastro commented 11 months ago

@chengming1108 maybe it's because the original code loads the entire dataset into memory for training. I switched to use huggingface's tokenizer to prevent this. This way the RAM consumption is more manageable.

StephennFernandes commented 10 months ago

@pvcastro hey if you dont mind, could you please share the code on how you are able to use huggingface datasets and tokenizer for memory efficiency, Thanks

pvcastro commented 10 months ago

DeBERTa_changes.zip Here they are @StephennFernandes . I'm still struggling to get consistent results with DeBERTa. I keep getting different results using the same parameters and the same seed, and I'm also struggling with a difference of tokenization between DeBERTa's tokenizer and huggingface's tokenizers.

StephennFernandes commented 10 months ago

@pvcastro hey man, same here the discriminator values don't converge even after training for days. have not yet checked how the evaluations are.

pvcastro commented 10 months ago

The generator and discriminator are actually converging in the pretraining, but the weird thing is that regardless of how good perplexity and evaluation loss are, some models are really good in downstream tasks, but others are terrible, getting worse at every new checkpoint.