Open samuel21119 opened 9 months ago
@samuel21119, thanks for reporting this issue. Unfortunately, there is no easy solution here. This is because ZeRO-* is fundamentally a data parallel algorithm. And so, all ranks in the data parallel group are expected to perform similar computation/communication, otherwise you will see the deadlocks you are observing.
@samuel21119, thanks for reporting this issue. Unfortunately, there is no easy solution here. This is because ZeRO-* is fundamentally a data parallel algorithm. And so, all ranks in the data parallel group are expected to perform similar computation/communication, otherwise you will see the deadlocks you are observing.
@tjruwase , thank you for your reply.
To prevent this issue, is it necessary to set the max_new_tokens
uniformly for all GPUs to align with ZeRO-* requirements?
Describe the bug I attempted to perform inference on the LLaMA2 70B model using DeepSpeed with Zero optimization (stage 3) across multiple GPUs (NVIDIA V100). While the initial setup appeared successful, I encountered a critical issue: if any process within the distributed environment does not execute the
model.generate()
function, it leads to a deadlock situation, causing processes on other GPUs to remain stuck indefinitely.To Reproduce Steps to reproduce the behavior:
model.generate()
functiondeepspeed --num_gpus 8 ...
Expected behavior The inference should not be stuck.
I understand that Zero optimization stage 3 allows multiple GPUs to share the model weights by utilizing faster P2P communication such as NVLink to avoid transferring repeated data from CPU to GPU via PCIe.
Is there any solution or workaround that allows all GPUs to continue working, even if not all processes are executing the function
model.generate()
? This would help maintain parallelism and prevent the deadlock situation currently observed.ds_report output
Screenshots In the provided screenshot, an infinite loop for rank 0 (utilizing GPU0) avoids executing the model.generate() function, demonstrating that other processes are also stuck, with no fluctuation in power usage.
System info: