sultanovazamat commented 4 years ago

❓ Questions & Help

Details

Hi! I am trying to run the distillation of XLM-Roberta to Albert (also little XLM-Roberta) on 4 GPUs (RTX 2080 Ti) after a bit of correction of the distillation script, so the training process goes through little chunks of dataset due to the difficulties with dataset preprocessing at once, but the problem is that running training throws CUDA OOM, although the GPU memory consumption is at max 70%.

I've discovered the close issue #1179 and tried to install torch from the source to avoid some bugs as it was suggested, but OOM comes again just a little bit later.

I've also tried several things, but all were unsuccessful: 1) Reducing the batch size and max length don't help, these just prolong the training process, but at some point distillation crashes again; 2) Run distillation in such manner: train on one chunk -> make checkpoint -> rerun distillation from pretrained; 3) Run with torch/apex distributed learning; 4) Run with --fp16 / --fp32; 5) Run with/without amp optimization;

Is it possible that the problem is related to the dataset? (Running training on different chunks throws OOM in different moments. BTW some chunks are processed fully without any errors).

I appreciate any help, no more guesses on how to solve this problem. Thanks!

BramVanroy commented 4 years ago

An error trace would be useful.

sultanovazamat commented 4 years ago

An error trace would be useful.

This is an error trace:

F.softmax(t_logits_slct / self.temperature, dim=-1), File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 366, in forward return F.kl_div(input, target, reduction=self.reduction) File "/opt/conda/lib/python3.7/site-packages/apex/amp/wrap.py", line 28, in wrapper return orig_fn(new_args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1987, in kl_div reduced = torch.kl_div(input, target, reduction_enum) RuntimeError: CUDA out of memory. Tried to allocate 818.00 MiB (GPU 0; 10.76 GiB total capacity; 8.61 GiB already allocated; 787.44 MiB free; 9.19 GiB reserved in total by PyTorch)

BramVanroy commented 4 years ago

787.44 MiB free

So your GPU doesn't have enough memory available at that point. (Even if nvidia-smi says it is only using 70%.)

There are known issues with apex that it doesn't work well when you reload checkpoints and continue training in the same Python session. Does the same issue occur when you use torch DDP (not apex), no FP16, no amp?

sultanovazamat commented 4 years ago

So your GPU doesn't have enough memory available at that point. (Even if nvidia-smi says it is only using 70%.)

Your point is right, but the strange thing is that this error can occur accidentally even if 99% of training time GPU consumption is less than 70%. (This happens even with tiny batch size).

The same error occurs with DDP, no FP16, no amp, moreover, I've also tried to run the distillation on a single GPU without distribution and the result is the same.

sultanovazamat commented 4 years ago

There are known issues with apex that it doesn't work well when you reload checkpoints and continue training in the same Python session.

BTW, I didn't reload checkpoint in the same python session. The distillation script was relaunched with loading the last checkpoint as soon as a new checkpoint was made, so the session is new.

VictorSanh commented 4 years ago

Hello @AzamatSultonov, As far as I know, the memory leak mentioned in #1179 was fixed and was released a couple of updates ago in PyTorch. I didn't encounter similar problems recently.

Can I ask what is your batch size? Have you tried a batch size of 1 (and slowly increase it)? 11GB is not a lot to fit two models (and train one of them).

sultanovazamat commented 4 years ago

Hello @VictorSanh, the minimum batch size that I've tried was 3 (1 takes too much time), but OOM threw again (with available GPU memory).

The #1179 fix helped to prolong the training time for bigger batch size, but didn't solve the problem completely.

BTW, I turned off the distributed way of training and launched the distillation on a single GPU with batch size 5 (periodically emptying CUDA cache) and the training goes for almost 48 hours without crashes. This is still slow, but at least without OOM and losses are going down. I'll let you know as soon as the training will finish.

VictorSanh commented 4 years ago

Are your batches of constant total size? i.e. do you need always need the exact same amount of gpu memory to do your intermediate computations? The reason why I suggested to start with a batch size of 1 is to detect this. You can always use gradient accumulation to simulate a bigger batch size. Something that can help is also tracking the memory in a tensorboard.

sultanovazamat commented 4 years ago

Are your batches of constant total size? i.e. do you need always need the exact same amount of gpu memory to do your intermediate computations?

Yes, they are. Also, in the last version of the script, I've changed the padding to the max length within the whole dataset instead of max length withing the current batch, to avoid tensor's memory reallocation by torch and reusing already allocated one.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

MasterHM-ml commented 3 years ago

I am facing this issue again, gpu usage is around 60% checked using nvidia-smi. In this tuning batch_size dosen't make sense, but still I had changed it but problem didn't solved.

Trying to fine tune XLM roberta for urdu classification transformers: 4.9.1 torch: 1.9.0+cu102

huggingface / transformers

Distillation throws CUDA out of memory even with available GPU memory #2954

❓ Questions & Help

Details