microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.05k stars 4.06k forks source link

[BUG] zero stage > 0 does not support the case when different iterations see different learnable parameters #1355

Open amsword opened 3 years ago

amsword commented 3 years ago

Describe the bug The model is a multi-task learning. Different iterations have different losses and thus the learnable parameters are not exactly the same across different iterations. In this case, we have seen such error when stage = 1

To Reproduce Steps to reproduce the behavior:

I made a toy code snippet

 def test_deepspeed_buf():
      class Model(nn.Module):
          def __init__(self):
              super().__init__()
              self.l1 = nn.Linear(128, 1)
              self.l2 = nn.Linear(128, 1)
              self.iter = 0

          def forward(self, x):
              self.iter += 1
              if (self.iter % 2) == 0:
                  y = self.l1(x)
                  return (y * y).sum()
              else:
                  y = self.l2(x)
                  return (y * y).sum()
      model = Model()
      optimizer = torch.optim.Adam(model.parameters(), 0.00001)
      config = {
          'fp16': {
              'enabled': True,
          },
          'zero_optimization': {
              'stage': 1
          },
          'train_batch_size': 4
      }
      os.environ['RANK'] = '0'
      os.environ['LOCAL_RANK'] = '0'
      os.environ['WORLD_SIZE'] = '1'
      os.environ['MASTER_ADDR'] = 'localhost'
      os.environ['MASTER_PORT'] = '12345'

      import deepspeed
      deepspeed.init_distributed(distributed_port=12345)
      model_engine, _, _, _ = deepspeed.initialize(
          config_params=config,
          model=model,
          optimizer=optimizer,
      )
      for i in range(10):
          x = torch.zeros((4, 128)).cuda().half()
          y = model_engine(x)
          model_engine.backward(y)
          model_engine.step()

run this code by, e.g. python script.py.


~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/engine.py in allreduce_gradients(self, bucket_size)
   1154             if self.zero_optimization_stage() == ZERO_OPTIMIZATION_OPTIMIZER_STATES:
   1155                 self.optimizer.reduce_gradients(
-> 1156                     pipeline_parallel=self.pipeline_parallelism)
   1157             else:
   1158                 self.buffered_allreduce_fallback(elements_per_buffer=bucket_size)

~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py in reduce_gradients(self, pipeline_parallel)
    498             for i, group in enumerate(self.fp16_groups):
    499                 for param in group:
--> 500                     self.reduce_ready_partitions_and_remove_grads(param, i)
    501
    502         # reduce any pending grads in either hook/non-hook case

~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py in reduce_ready_partitions_and_remove_grads(self, param, i)
   1115     def reduce_ready_partitions_and_remove_grads(self, param, i):
   1116         if self.partition_gradients or self.is_gradient_accumulation_boundary:
-> 1117             self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
   1118
   1119     def zero_reduced_gradients(self, partition_id, i):

~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py in reduce_independent_p_g_buckets_and_remove_grads(self, param, i)
    741         self.elements_in_ipg_bucket += param.numel()
    742
--> 743         assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
    744
    745         self.grads_in_ipg_bucket.append(param.grad)

AssertionError: rank 0 - Invalid to reduce Param 0 with None gradient
*** NameError: name 'pdb' is not defined
> /home/jianfw/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py(743)reduce_independent_p_g_buckets_and_remove_grads()
    741         self.elements_in_ipg_bucket += param.numel()
    742
--> 743         assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
    744
    745         self.grads_in_ipg_bucket.append(param.grad)
tjruwase commented 3 years ago

@amsword, can you please try that master branch against this script? I think this might already be fixed.

tjruwase commented 2 years ago

@amsword, are you still seeing this problem or is it okay to close? Thanks.