microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.12k stars 3.99k forks source link

[BUG] Problem with optimizer update step #1806

Open benjamin-reichman opened 2 years ago

benjamin-reichman commented 2 years ago

I keep having this trouble with the optimizer and I am not sure what is causing it. The error is below:

Traceback (most recent call last):
  File "02232022_kat_repr_train.py", line 89, in <module>
    first_exp.run_experiment()
  File "experiments.py", line 28, in run_experiment
    self.trainer.train(self.hyperparameters["epochs"], self.hyperparameters["eval_period"])
  File "trainer.py", line 90, in train
    self.optimizer.step()
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1633, in step
    self.check_overflow()
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1913, in check_overflow
    self._check_overflow(partition_gradients)
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1818, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1837, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1830, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0

My training loop is nothing fancy, pretty standard:

outputs = self.model(**batch)
loss = self.loss_function(outputs)
loss.backward()
self.optimizer.backward(loss)
self.model.zero_grad()

I initialize deepspeed like this: self.model, self.optimizer, _, self.lr_scheduler = ds.initialize(model=self.model, config_params=self.deepspeed_config, optimizer=self.optimizer, lr_scheduler=self.lr_scheduler)

I am using the example configurations from here: https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM/scripts/ds_zero2_config.json

Does anyone know why I am getting this issue?

Thank you!

tjruwase commented 2 years ago

@benjamin-reichman, thanks for reporting this error. That Megatron-LM code is quite old. Is it possible for you to use this version? This one is more actively used. Thanks!

ZyHuang1 commented 1 year ago

@benjamin-reichman, did you solve this problem yet? I met the same problem and I wonder how to solve it.

HoBeedzc commented 1 year ago

@benjamin-reichman, did you solve this problem yet? I met the same problem and I wonder how to solve it.

chenfengshijie commented 12 months ago

I met the same problem.