Describe the bug
The model is a multi-task learning. Different iterations have different losses and thus the learnable parameters are not exactly the same across different iterations. In this case, we have seen such error when stage = 1
~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/engine.py in allreduce_gradients(self, bucket_size)
1154 if self.zero_optimization_stage() == ZERO_OPTIMIZATION_OPTIMIZER_STATES:
1155 self.optimizer.reduce_gradients(
-> 1156 pipeline_parallel=self.pipeline_parallelism)
1157 else:
1158 self.buffered_allreduce_fallback(elements_per_buffer=bucket_size)
~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py in reduce_gradients(self, pipeline_parallel)
498 for i, group in enumerate(self.fp16_groups):
499 for param in group:
--> 500 self.reduce_ready_partitions_and_remove_grads(param, i)
501
502 # reduce any pending grads in either hook/non-hook case
~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py in reduce_ready_partitions_and_remove_grads(self, param, i)
1115 def reduce_ready_partitions_and_remove_grads(self, param, i):
1116 if self.partition_gradients or self.is_gradient_accumulation_boundary:
-> 1117 self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
1118
1119 def zero_reduced_gradients(self, partition_id, i):
~/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py in reduce_independent_p_g_buckets_and_remove_grads(self, param, i)
741 self.elements_in_ipg_bucket += param.numel()
742
--> 743 assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
744
745 self.grads_in_ipg_bucket.append(param.grad)
AssertionError: rank 0 - Invalid to reduce Param 0 with None gradient
*** NameError: name 'pdb' is not defined
> /home/jianfw/code/quickdetection/src/DeepSpeed/deepspeed/runtime/zero/stage2.py(743)reduce_independent_p_g_buckets_and_remove_grads()
741 self.elements_in_ipg_bucket += param.numel()
742
--> 743 assert param.grad is not None, f"rank {dist.get_rank()} - Invalid to reduce Param {param_id} with None gradient"
744
745 self.grads_in_ipg_bucket.append(param.grad)
Describe the bug The model is a multi-task learning. Different iterations have different losses and thus the learnable parameters are not exactly the same across different iterations. In this case, we have seen such error when stage = 1
To Reproduce Steps to reproduce the behavior:
I made a toy code snippet
run this code by, e.g. python script.py.