microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.24k stars 4.08k forks source link

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

Closed aced125 closed 3 years ago

aced125 commented 3 years ago

Hi - I'm getting a new error while trying to train a model on a 8 x V100 box. I'm using pytorch lightning but don't think that should make a difference too much.

Sys config:

Pytorch 1.8 Cuda 10.2 Ubuntu 18.04 Deepspeed 0.3.14 Triton 0.2.3 Apex master branch Pytorch lightning 1.3.0rc1

Error trace:

Epoch 0:   0%|                                                                                | 0/564 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 488, in fit
    self.dispatch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in dispatch
    self.accelerator.start_training(self)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 95, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 142, in start_training
    self._results = trainer.run_stage()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in run_stage
    self.run_train()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 422, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 575, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1414, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 301, in optimizer_step
    self.lightning_module, optimizer, opt_idx, lambda_closure, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 47, in pre_optimizer_step
    lambda_closure()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 570, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 709, in backward
    result.closure_loss, optimizer, opt_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 284, in backward
    self.lightning_module, closure_loss, optimizer, optimizer_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 73, in backward
    deepspeed_engine.backward(closure_loss, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1020, in backward
    self.allreduce_gradients()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 940, in allreduce_gradients
    self.optimizer.overlapping_partition_gradients_reduce_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1393, in overlapping_partition_gradients_reduce_epilogue
    self.independent_gradient_partition_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1295, in independent_gradient_partition_epilogue
    self.partition_previous_reduced_grads()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1657, in partition_previous_reduced_grads
    param.partition_gradients(partition_buffers=self.temp_grad_gpu_buffer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 460, in partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 794, in _partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 862, in _partition_gradient
    param.grad.data = dest_tensor_full_buffer.data
UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment
aced125 commented 3 years ago

Upon further investigation, this error only happens for Zero level 3. Zero level 2 works just fine.

aced125 commented 3 years ago
tensor([1.], device='cuda:3', dtype=torch.float16, requires_grad=True)

This is the type of tensor that I get when I print what tensor is causing this issue.

This makes sense: this tensor has 1 element (tens.numel() = 1), therefore the partition size is 1, so if the rank > 1, partition_size * rank > 1 and the bug will happen!

aced125 commented 3 years ago

Even weirder:

WORKS with 4 GPUs (batch size 1, 2, 4, 8, 16 all works)

FAILS with 8 GPUs (batch size 1, 2, 4, 8, 16 ALL FAILS)

What could be going on here?

aced125 commented 3 years ago

SOLVED: When there is a a parameter in the network with numel < num_gpus, the system FAILS.

E.g if num_gpus = 8, but a parameter in the network only has 6 elements, the system will fail as above.

aced125 commented 3 years ago

@jeffra Not sure if this is intended behaviour? If so, would be definitely good to warn people.

The reason why this is important is because regression problems often have a linear layer with a bias that has very few parameters e.g

self.linear = nn.Linear(256, 1, bias=True)

The bias in this layer will only have 1 parameter, therefore the system will fail if it has more than 1 GPU.

tjruwase commented 3 years ago

@aced125, thanks for reporting and investigating this corner case. This is not intended behavior, our approach is not to partition or offload tiny parameters to cpu, and that should handle this case. Based on your new findings, can you please clarify under what conditions the error is triggered?

aced125 commented 3 years ago

Sorry - actually I think I was wrong... The error is still happening...

aced125 commented 3 years ago

More findings:

aced125 commented 3 years ago

Okay - I'm now actually finding that sometimes it works, and sometimes it doesn't work. This is getting really weird.

I'll run it once with some settings. It works. Then run it again and boom I get this error.

It could be because of the dataloader. Let me turn shuffle off and drop the last batch.

aced125 commented 3 years ago

No luck on the dataloader.

@tjruwase could it be because of low CPU RAM? And if so, how to debug?

tjruwase commented 3 years ago

Could you try disabling cpu offloading of params by setting cpu_offload_params to false?

aced125 commented 3 years ago

@tjruwase Just tried turning off CPU offloading.

Works on 4 GPUs Fails on 8 GPUs, same issue

aced125 commented 3 years ago

Btw - level 2 works on everything. It's level 3 that's the issue

aced125 commented 3 years ago

More info - fails on 5 GPU.

aced125 commented 3 years ago

Another weird quirk on stage 2: Sometimes it says cannot allocate memory, sometimes it runs just fine... Dataloader shuffle is off

tjruwase commented 3 years ago

Can you share logs of stage 2 failing to allocate memory?

tjruwase commented 3 years ago

@aced125, thanks for the hard work in creating a stable repro with zero stage 3. Is failure on 5 & 8 GPU repeatable?

aced125 commented 3 years ago

@tjruwase @jeffra I have FINALLY spotted the error!

In my network I am outputting some tensors for classification (where there are N classes).

When N = 36, the whole thing works on 8 GPUs.

When N = 35, it FAILS on 8 GPUs with the above error, but WORKS on 4 GPUs!!!

import torch.nn.functional as F
import torch as th

labels = th.randint(low=0, high=36, size=(32, ))
predictions = model(**inputs) # shape (32, 36). Or do th.randn(32, 36)

loss = F.cross_entropy(predictions, labels)
model.backward(loss)

Any idea why this is the case?

tjruwase commented 3 years ago

@aced125, are you still seeing issues?

aced125 commented 3 years ago

Yes - but I solved it in the following hacky way:

output_dim = 15

lin = nn.Linear(in_dim, 64)

y = lin(x)
y = y[:, :output_dim]
aced125 commented 3 years ago

It seems that when output_dim >= 36 things seem to work, else it fails.

tjruwase commented 3 years ago

So can you please provide steps to repro the failure so we can continue investigation?

SantoshGuptaML commented 3 years ago
tensor([1.], device='cuda:3', dtype=torch.float16, requires_grad=True)

This is the type of tensor that I get when I print what tensor is causing this issue.

This makes sense: this tensor has 1 element (tens.numel() = 1), therefore the partition size is 1, so if the rank > 1, partition_size * rank > 1 and the bug will happen!

I am having the same issue, trying to get the bing squad example to work with 4 gpus. How were you able to print the exact tensor that was causing the issue? I wish to do the same to figure out where the issue is happening.

tjruwase commented 3 years ago

@SantoshGuptaML , can you clarify the exact error you are seeing since multiple issues were involved here.

To your question about printing actual tensor values, you need to use Gather api like as follows:

       for n, p in model.named_parameters():
            with deepspeed.zero.GatheredParameters(p):
                val = p.detach().to('cpu').data.float()
                print0("{} {}: {} {}".format(tag, n, val.shape, val))