microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.61k stars 4.14k forks source link

[BUG] Why the results were inconsistent in two identical tests with config zero2 + overlap_comm #5523

Open Suparjie opened 6 months ago

Suparjie commented 6 months ago

Describe the bug

When I train model (Qwen 14B) using DeepSpeed, I find that the results of two tests were significantly inconsistent. first: ` {'loss': 2.0, 'grad_norm': 20.78711275277921, 'learning_rate': 9.330127018922195e-06, 'epoch': 0.17}

{'loss': 1.5547, 'grad_norm': 59.310739679010915, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.33}

{'loss': 0.5781, 'grad_norm': 15.005973828390784, 'learning_rate': 5e-06, 'epoch': 0.5}

{'loss': 0.3184, 'grad_norm': 9.697505381713714, 'learning_rate': 2.5000000000000015e-06, 'epoch': 0.67}

{'loss': 0.1318, 'grad_norm': 6.17889934461755, 'learning_rate': 6.698729810778065e-07, 'epoch': 0.83}

{'loss': 0.0859, 'grad_norm': 4.75770403827632, 'learning_rate': 0.0, 'epoch': 1.0} `

second: ` {'loss': 2.0, 'grad_norm': 20.78707942800991, 'learning_rate': 9.330127018922195e-06, 'epoch': 0.17}

{'loss': 1.5547, 'grad_norm': 43.96039229387614, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.33}

{'loss': 0.6484, 'grad_norm': 12.841190229236128, 'learning_rate': 5e-06, 'epoch': 0.5}

{'loss': 0.3105, 'grad_norm': 9.612021710541004, 'learning_rate': 2.5000000000000015e-06, 'epoch': 0.67}

{'loss': 0.1172, 'grad_norm': 5.885649690333212, 'learning_rate': 6.698729810778065e-07, 'epoch': 0.83}

{'loss': 0.0713, 'grad_norm': 4.291706145544393, 'learning_rate': 0.0, 'epoch': 1.0}

` We can find that the grad_norm is different at step 2, and the loss is different at step 3.

If I add a synchronization in the code below, the results will remain consistent (I've tested it three times). https://github.com/microsoft/DeepSpeed/blob/462def451e06f10f12916c49f34800d52ab52446/deepspeed/runtime/zero/stage_1_and_2.py#L965 `

def average_tensor(self, tensor):

    if self.overlap_comm:

        stream = self.reduction_stream

        if not get_accelerator().is_synchronized_device():

            stream.wait_stream(get_accelerator().current_stream())

            stream.synchronize()  # **force synchronize the allreduce stream**

    else:

        stream = get_accelerator().current_stream()

    with get_accelerator().stream(stream):

        if not self.reduce_scatter:

            self.gradient_reduction_w_predivide(tensor)

            return

`

The results remain consistent in three times: ` {'loss': 2.0, 'grad_norm': 20.78732614100428, 'learning_rate': 9.330127018922195e-06, 'epoch': 0.17}

{'loss': 1.5469, 'grad_norm': 16.731201175437484, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.33}

{'loss': 0.5586, 'grad_norm': 14.621543271989035, 'learning_rate': 5e-06, 'epoch': 0.5}

{'loss': 0.3066, 'grad_norm': 9.533331203714019, 'learning_rate': 2.5000000000000015e-06, 'epoch': 0.67}

{'loss': 0.1226, 'grad_norm': 5.927102870076524, 'learning_rate': 6.698729810778065e-07, 'epoch': 0.83}

{'loss': 0.0796, 'grad_norm': 4.49918771613179, 'learning_rate': 0.0, 'epoch': 1.0} `

Analysis:

Whether the double buffer and overlap mechanism is correct? Now let's consider an extreme case,the reduction_stream computes slow and the current_stream computes fast. When the reduction_stream has not finished, the second self.ipg_buffer has begin. Will the data in the self.ipg_buffer be overwritten? If it gets overwritten, the calculation will be incorrect. If we add the synchronization, the self.ipg_buffer operations and reduction_stream communication will be interleaved, thereby ensuring that the buffer will not be overwritten.

To Reproduce Model: Qwen 14b (https://huggingface.co/Qwen/Qwen-14B) deepspeed : 0.10.2 deepspeed config: { "train_micro_batch_size_per_gpu": "auto", "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 1e9, "overlap_comm": true, "reduce_scatter": false, "reduce_bucket_size": 5e8, "contiguous_gradients": true } }

GuanhuaWang commented 6 months ago

Hi @Suparjie , thx for raise this issue. I believe the reduction stream and default compute stream are not sync properly.

by adding your above stream.wait_stream(get_accelerator().current_stream()), I am wondering if the iteration time increased significantly?

Suparjie commented 6 months ago

In that case, the iteration time increased by approximately 5%. Because there is still computing and communication overlap.

Suparjie commented 6 months ago

Will you fix the bug?

GuanhuaWang commented 6 months ago

Sorry I don't think it is the correct fix. forcing stream.synchronize() meaning the corresponding nccl call will be blocking call and not overlap with subsequent compute.

Suparjie commented 6 months ago

I don't think so. Even so (forcing stream.synchronize()), there is also overlap because we have two buff. The backward process can overlap with the communication.