microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.14k stars 4.07k forks source link

[BUG] DeepSpeed Zero Inference (stage 3) Stuck When One Process Doesn't Execute `model.generate()` #4910

Open samuel21119 opened 9 months ago

samuel21119 commented 9 months ago

Describe the bug I attempted to perform inference on the LLaMA2 70B model using DeepSpeed with Zero optimization (stage 3) across multiple GPUs (NVIDIA V100). While the initial setup appeared successful, I encountered a critical issue: if any process within the distributed environment does not execute the model.generate() function, it leads to a deadlock situation, causing processes on other GPUs to remain stuck indefinitely.

To Reproduce Steps to reproduce the behavior:

  1. Modify the script in DeepSpeedExamples to support running DeepSpeed with Zero optimization on multiple GPUs.
  2. Add an infinite loop to a specific rank before the model.generate() function
  3. (This modification also generates the bug): Have one process finish generating prompts while others are still generating, and the finished one does not continue generating the next prompt. In this case, even with a barrier to prevent the finished process from deleting its model, the bug still occurs.
  4. Use DeepSpeed v0.12.6 and Torch v2.0.1+cu117
  5. Execute the script with deepspeed --num_gpus 8 ...

Expected behavior The inference should not be stuck.

I understand that Zero optimization stage 3 allows multiple GPUs to share the model weights by utilizing faster P2P communication such as NVLink to avoid transferring repeated data from CPU to GPU via PCIe.

Is there any solution or workaround that allows all GPUs to continue working, even if not all processes are executing the function model.generate()? This would help maintain parallelism and prevent the deadlock situation currently observed.

ds_report output

{
    "fp16": {
        "enabled": true
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_optimization": {
        "stage": 3, 
        "stage3_prefetch_bucket_size": 2.684355e+08, 
        "stage3_param_persistence_threshold": 8.192000e+03, 
        "stage3_max_live_parameters": 2.684355e+08, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        }
    }, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 16, 
    "wall_clock_breakdown": false
}

Screenshots In the provided screenshot, an infinite loop for rank 0 (utilizing GPU0) avoids executing the model.generate() function, demonstrating that other processes are also stuck, with no fluctuation in power usage.

image

System info:

tjruwase commented 9 months ago

@samuel21119, thanks for reporting this issue. Unfortunately, there is no easy solution here. This is because ZeRO-* is fundamentally a data parallel algorithm. And so, all ranks in the data parallel group are expected to perform similar computation/communication, otherwise you will see the deadlocks you are observing.

samuel21119 commented 9 months ago

@samuel21119, thanks for reporting this issue. Unfortunately, there is no easy solution here. This is because ZeRO-* is fundamentally a data parallel algorithm. And so, all ranks in the data parallel group are expected to perform similar computation/communication, otherwise you will see the deadlocks you are observing.

@tjruwase , thank you for your reply. To prevent this issue, is it necessary to set the max_new_tokens uniformly for all GPUs to align with ZeRO-* requirements?