Open Coobiw opened 5 months ago
@Coobiw, you can you use the GatheredParameters context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of computing moving average of parameters here.
Hi, I've tried this before. But the program is stuck. How can I debug this?
And I want to know whether it is because I use 30B+ LLM and zero3 inference is very slow?
if self.zero_stage == 3:
params_to_fetch = [
p for p in self.model.parameters()
if hasattr(p, 'ds_id') and p.ds_status == deepspeed.zero.partition_parameters.ZeroParamStatus.NOT_AVAILABLE
]
should_gather_param = len(params_to_fetch) > 0
with deepspeed.zero.GatheredParameters(params_to_fetch, enabled=should_gather_param):
self.model.eval()
evaluation() # contain model.generate()
@Coobiw, can you share your full script to help us repro on our side?
Is this a dense or MoE model?
In terms of debugging, can you use prints to pin-point the hang point?
Also, can you try to repro on single gpu so that you can use pdb for debugging. You can try two options for this:
Sorry, it is inconvenient to share the whole code. I would try my best to provide more information. It is a dense model. I've tried the script on my ~9B model on A100 80GB. Similar stuck appeared.
I think it may be a multi-gpu communication problem? No explicit bug. Only a warning in model.generate
which is related with NCCL
.
/root/miniconda3/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
t-20240517175036-k966t-worker-0:5136:5282 [7] ib_plugin.c:798 NCCL WARN NET/IB : req 0/1 tag 7 peer 172.25.40.117<36987> collective mismatch error, local size 897024 remote size 614400
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO transport/net.cc:990 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:679 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:858 -> 5 [Proxy Thread]
I guess collective mismatch error, local size 897024 remote size 614400
causes the stuck.
Additionally, my env is as following:
deepspeed == 0.14.0
cuda: 11.8
The output of nvcc -V
is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
after double check, I find another error message on one worker. as following(time-out error probably):
[E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
t-20240517230118-grg2t-worker-1:5123:5271 [0] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5125:5269 [2] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5127:5272 [4] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5129:5270 [6] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5275 [7] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5206 [0] NCCL INFO comm 0x738ea950 rank 15 nranks 64 cudaDev 7 busId e4000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
hi, I also test this in one node(8 x A100) with one 9B model. Stuck appeared. TAT
Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm
Oh, thanks, I get it. Do you have any suggestion about this? I think I've done left-padding. How to ensure the output length?
@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?
Describe the bug Hi, I use zero-3 for MLLM training. After one-epoch training stage, I want to evaluate this model(using model.generate()). However, params of the model are located on multi-gpu, lacking of gather.
If not gathering params, during evaluation(generation), error will be raised because the forward process like:
RuntimeError: The size of tensor a (1152) must match the size of tensor b (0) at non-singleton dimension 2
How can I gather params on every gpu for paralized evaluation(inference/generation), liking using
deepspeed.zero.GatheredParameters
? And after evaluation, how can I shard the model parameters again for next training epoch?Thanks for your reply!
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
ds_report output Please run
ds_report
to give us details about your setup.Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.