[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training

Coobiw commented 5 months ago

Describe the bug Hi, I use zero-3 for MLLM training. After one-epoch training stage, I want to evaluate this model(using model.generate()). However, params of the model are located on multi-gpu, lacking of gather.

If not gathering params, during evaluation(generation), error will be raised because the forward process like:

image_embeds += self.pos_embed

RuntimeError: The size of tensor a (1152) must match the size of tensor b (0) at non-singleton dimension 2

How can I gather params on every gpu for paralized evaluation(inference/generation), liking using deepspeed.zero.GatheredParameters? And after evaluation, how can I shard the model parameters again for next training epoch?

Thanks for your reply!

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

tjruwase commented 5 months ago

@Coobiw, you can you use the GatheredParameters context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of computing moving average of parameters here.

Coobiw commented 5 months ago

Hi, I've tried this before. But the program is stuck. How can I debug this?

And I want to know whether it is because I use 30B+ LLM and zero3 inference is very slow?

if self.zero_stage == 3:
                params_to_fetch = [
                    p for p in self.model.parameters()
                    if hasattr(p, 'ds_id') and p.ds_status == deepspeed.zero.partition_parameters.ZeroParamStatus.NOT_AVAILABLE
                ]
                should_gather_param = len(params_to_fetch) > 0
                with deepspeed.zero.GatheredParameters(params_to_fetch, enabled=should_gather_param):
                    self.model.eval()
                    evaluation() # contain model.generate()

tjruwase commented 5 months ago

@Coobiw, can you share your full script to help us repro on our side?

Is this a dense or MoE model?

In terms of debugging, can you use prints to pin-point the hang point?

Also, can you try to repro on single gpu so that you can use pdb for debugging. You can try two options for this:

Enable cpu/nvme offloading to fit the model, or
Use smaller model

Coobiw commented 5 months ago

Sorry, it is inconvenient to share the whole code. I would try my best to provide more information. It is a dense model. I've tried the script on my ~9B model on A100 80GB. Similar stuck appeared.

I think it may be a multi-gpu communication problem? No explicit bug. Only a warning in model.generate which is related with NCCL.

/root/miniconda3/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
t-20240517175036-k966t-worker-0:5136:5282 [7] ib_plugin.c:798 NCCL WARN NET/IB : req 0/1 tag 7 peer 172.25.40.117<36987> collective mismatch error, local size 897024 remote size 614400
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO transport/net.cc:990 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:679 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:858 -> 5 [Proxy Thread]

I guess collective mismatch error, local size 897024 remote size 614400 causes the stuck.

Additionally, my env is as following:

deepspeed == 0.14.0
cuda: 11.8

The output of nvcc -V is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Coobiw commented 5 months ago

after double check, I find another error message on one worker. as following（time-out error probably）:

[E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
t-20240517230118-grg2t-worker-1:5123:5271 [0] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5125:5269 [2] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5127:5272 [4] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5129:5270 [6] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5275 [7] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5206 [0] NCCL INFO comm 0x738ea950 rank 15 nranks 64 cudaDev 7 busId e4000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'

Coobiw commented 5 months ago

hi, I also test this in one node(8 x A100) with one 9B model. Stuck appeared. TAT

tjruwase commented 5 months ago

Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm

Coobiw commented 5 months ago

Oh, thanks, I get it. Do you have any suggestion about this? I think I've done left-padding. How to ensure the output length?

tjruwase commented 5 months ago

@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?

microsoft / DeepSpeed

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training #5539