ZeRO 3 error: expected the next 4 parameters in the parameter fetch queue to be ... but got ()

dcaffo98 commented 1 year ago

System Info

transformers version: 4.27.4
Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.13.3
PyTorch version (GPU?): 2.0.0+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: distributed with 2 NVIDIA RTX A5000 GPUs

Who can help?

@stas00 may be the more suited for this since the issue is probably related to deepspeed

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Currently, I'm struggling to make a reproducible script, as the errors happens suddenly during training with ZeRO 3 stage activated and I'm using a custom dataset. The task is a contrastive loss pertaining. The backbone is the GLPN's encoder model, followed by a custom Attention Pooling module. The parameters causing the issues Deepspeed version is 0.9.1 The issue may be related to this, although the stack trace is not identical The error shows only when resuming from a checkpoint (resuming_from_checkpoint=/path/to/checkpoint). I'm attaching the log output (error.txt), along with the deepspeed ZeRO 3 configuration (config_adam_zero3.txt) I'm using, plus the custom model implementation (modeling_custom_apr.txt). config_adam_zero3.txt error.txt modeling_custom_apr.txt

This is the last part of the log where the error shows up

  5[2023-05-23 14:02:25,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=14290, skipped=17, lr=[0.00014992267618019753], mom=[(0.9, 0.999)]
[2023-05-23 14:02:25,783] [INFO] [timer.py:199:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=8.340844178398823, CurrSamplesPerSec=8.091999012978865, MemAllocated=0.4GB, MaxMemAllocated=19.03GB
{'loss': 1.0438, 'learning_rate': 0.00014992267618019753, 'epoch': 3.68}
[2023-05-23 14:02:36,757] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, but hysteresis is 2. Reducing hysteresis to 1
%|▍         | 14287/305600 [3:34:27<454:15:14,  5.61s/it]
  5%|▍         | 14288/305600 [3:34:33<467:44:45,  5.78s/it]
  5%|▍         | 14289/305600 [3:34:38<455:08:12,  5.62s/it]
  5%|▍         | 14290/305600 [3:34:43<443:40:08,  5.48s/it]

  5%|▍         | 14290/305600 [3:34:43<443:40:08,  5.48s/it]
  5%|▍         | 14291/305600 [3:34:49<448:35:16,  5.54s/it]
  5%|▍         | 14292/305600 [3:34:54<442:30:06,  5.47s/it]Traceback (most recent call last):
  File "/mnt/beegfs/scratch/dcaffagni/runs/clpt_gpu_2_lr_154_cos_10k_wu/maticad_side/train.py", line 96, in <module>
    train_out = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 2661, in training_step
    loss = self.deepspeed.backward(loss)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1796, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 169, in backward
    ctx.pre_backward_function(ctx.module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 419, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 500, in pre_sub_module_backward_function
    param_coordinator.fetch_sub_module(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
Traceback (most recent call last):
  File "/mnt/beegfs/scratch/dcaffagni/runs/clpt_gpu_2_lr_154_cos_10k_wu/maticad_side/train.py", line 96, in <module>
    train_out = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 2661, in training_step
    loss = self.deepspeed.backward(loss)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1796, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 169, in backward
    ctx.pre_backward_function(ctx.module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
[modeling_custom_apr.txt](https://github.com/huggingface/transformers/files/11545331/modeling_custom_apr.txt)

  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 419, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 500, in pre_sub_module_backward_function
    param_coordinator.fetch_sub_module(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 288, in fetch_sub_module
    raise RuntimeError(
RuntimeError: tracing error at step 999: 
module id: 921, training: True
expected the next 4 parameters in the parameter fetch queue to be ({'id': 'name=attn_pool.k_proj.bias id=915', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.v_proj.bias id=919', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.c_proj.bias id=921', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.q_proj.bias id=917', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}) 
but got 
 ().

Expected behavior

After resuming from a checkpoint, the training should proceed fine, as it happens when training with the same setup from scratch.

stas00 commented 1 year ago

Hi @dcaffo98, it'd be the best to file this directly with Deepspeed https://github.com/microsoft/DeepSpeed/issues since the issue is on the Deepspeed side.

In general such issues relate to code that changes the model after it was initialized, but there are many complex nuanced situations so it's best to talk to the DS developers directly.

dcaffo98 commented 1 year ago

I've filed the issue to the DS team as well. It may be worth noting that the error happens right after the first detected OVERFLOW in the run. However, multiple overflows occurred during the previous 24h of training (before resuming from the checkpoint).

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers