FSDP with PPO trainer won't work because FSDP doesn't support model.generate

artkpv commented 3 months ago

Hi! This issue seems to be a feature request or bug report. How to fine tuning with PPO models that doesn't fit one GPU? I'm using FSDP from torch.dist with Accelerate. Unfortunately, ppo_trainer.py and ppo2_trainer.py uses model.generate(..) (GenerationMixin from transformers). Now, when I try to generate tokens it fails with "runtimeerror: 'weight' must be 2-d" (see this ). It looks that TRL now doesn't support FSDP because of this dependency to unsupported model.generate.

The error:

> Traceback (most recent call last):
>   File "/data/artyom_karpov/rl4steg/train.py", line 385, in <module>
>     train(context)
>   File "/data/artyom_karpov/rl4steg/train.py", line 308, in train
>     res = ppo_trainer.model.generate(
>           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/trl/models/modeling_value_head.py", line 204, in generate
>     return self.pretrained_model.generate(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/peft/peft_model.py", line 1491, in generate
>     outputs = self.base_model.generate(*args, **kwargs)
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
>     return func(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 1758, in generate
>     result = self._sample(
>              ^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 2397, in _sample
>     outputs = self(
>               ^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
>     return self._call_impl(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
>     return forward_call(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1166, in forward
>     outputs = self.model(
>               ^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
>     return self._call_impl(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
>     return forward_call(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 926, in forward
>     inputs_embeds = self.embed_tokens(input_ids)
925 modelling_llama.py LlamaModel(
  (embed_tokens): Embedding(128256, 8192)
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
>     return self._call_impl(*args, **kwargs)
module.py 1532 self=Embedding(128256, 8192)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
>     return forward_call(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
module.py 1542 self=Embedding(128256, 8192) 128257*8192 = 1 050 673 152
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 164, in forward
>     return F.embedding(

sparse.py 163: torch.Size([525340673]) torch.Size([1, 9])  525 340 673

>            ^^^^^^^^^^^^
>   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/nn/functional.py", line 2267, in embedding
>     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

functional.py: weight.shape=torch.Size([525340673]), input.shape=torch.Size([1, 9]), padding_idx=-1, scale_grad_by_freq=False, sparse=False

>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> runtimeerror: 'weight' must be 2-d

My setup:

Python 3.11.5 accelerate 0.30.1 torch 2.4.0.dev20240515+cu121 torchaudio 2.2.0.dev20240515+cu121 torchvision 0.19.0.dev20240515+cu121 transformers 4.41.2

Thanks for your attention

vwxyzjn commented 3 months ago

Thanks for reporting. This is a known issue. Unfortunately currently FSDP for PPO is not supported, but we are considering it for future iterations. For now, please use deepspeed stage 2 or 3.

artkpv commented 3 months ago

Thanks for reporting. This is a known issue. Unfortunately currently FSDP for PPO is not supported, but we are considering it for future iterations. For now, please use deepspeed stage 2 or 3.

Thanks for the reply, @vwxyzjn . Yeah, I see it fails here and there. Now there is this error:

2024-06-17 08:37:15,707::93731__main__:DEBUG Before PPO step
[rank1]:[E617 08:47:15.010046090 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=_ALLGATHER_BASE, NumelIn=214016000, NumelOut=856064000, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
[rank1]:[E617 08:47:15.011359495 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 7930, last enqueued NCCL work: 7931, last completed NCCL work: 7929.
[rank2]:[E617 08:47:15.047646334 ProcessGroupNCCL.cpp:572] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=_ALLGATHER_BASE, NumelIn=214016000, NumelOut=856064000, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
[rank2]:[E617 08:47:15.048060595 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 7930, last enqueued NCCL work: 7931, last completed NCCL work: 7929.
[rank0]:[E617 08:47:15.098972015 ProcessGroupNCCL.cpp:572] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=_ALLGATHER_BASE, NumelIn=214016000, NumelOut=856064000, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
[rank0]:[E617 08:47:15.099381586 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 7930, last enqueued NCCL work: 7931, last completed NCCL work: 7929.
[rank3]:[E617 08:47:15.617045435 ProcessGroupNCCL.cpp:572] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
[rank3]:[E617 08:47:15.617461256 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 7930, last enqueued NCCL work: 7930, last completed NCCL work: 7929.
[rank1]:[E617 08:47:15.836224160 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 7930, last enqueued NCCL work: 7931, last completed NCCL work: 7929.
[rank2]:[E617 08:47:15.836226690 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 7930, last enqueued NCCL work: 7931, last completed NCCL work: 7929.
[rank1]:[E617 08:47:15.836557209 ProcessGroupNCCL.cpp:586] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E617 08:47:15.836835376 ProcessGroupNCCL.cpp:586] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E617 08:47:15.837097713 ProcessGroupNCCL.cpp:592] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E617 08:47:15.837359220 ProcessGroupNCCL.cpp:592] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E617 08:47:15.838567322 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 7930, last enqueued NCCL work: 7931, last completed NCCL work: 7929.
[rank1]:[E617 08:47:15.838647484 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=_ALLGATHER_BASE, NumelIn=214016000, NumelOut=856064000, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f710bfecde6 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f710d2938f2 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7f710d299f67 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f710d29bd6c in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f7159f43bf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f716211fea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f716173fb2d in /lib64/libc.so.6)

[rank0]:[E617 08:47:15.838852370 ProcessGroupNCCL.cpp:586] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E617 08:47:15.838951683 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=_ALLGATHER_BASE, NumelIn=214016000, NumelOut=856064000, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7efc19573de6 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7efc1a81a8f2 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7efc1a820f67 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7efc1a822d6c in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7efc674cabf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7efc6f6a6ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7efc6ecc6b2d in /lib64/libc.so.6)

[rank0]:[E617 08:47:15.839506898 ProcessGroupNCCL.cpp:592] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E617 08:47:15.841090820 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=_ALLGATHER_BASE, NumelIn=214016000, NumelOut=856064000, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4573079de6 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f45743208f2 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7f4574326f67 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4574328d6c in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f45c0fd0bf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f45c91acea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f45c87ccb2d in /lib64/libc.so.6)

[rank3]:[E617 08:47:16.241666661 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 3] Timeout at NCCL work: 7930, last enqueued NCCL work: 7930, last completed NCCL work: 7929.
[rank3]:[E617 08:47:16.242020870 ProcessGroupNCCL.cpp:586] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E617 08:47:16.242297407 ProcessGroupNCCL.cpp:592] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E617 08:47:16.243859509 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7930, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f22a993dde6 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f22aabe48f2 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7f22aabeaf67 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f22aabecd6c in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f22f7894bf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f22ffa70ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f22ff090b2d in /lib64/libc.so.6)

After I run ppo_trainer.step

vwxyzjn commented 3 months ago

Is this with DS3? Also, would you like to try out the PPOv2Trainer?

artkpv commented 3 months ago

Is this with DS3? Also, would you like to try out the PPOv2Trainer?

@vwxyzjn Sorry for the half-baked bug report. See it at https://github.com/huggingface/accelerate/issues/2868 . I thought to use PPOv2Trainer but it doesn't suit me because I do inference twice for my policy and use both results to assess, i.e. res1 = policy(x), res2 = policy(res1, x), and then I compute reward using another model and those two results. But PPOv2Trainer allows only one inference as I see it.

artkpv commented 3 months ago

@vwxyzjn , and to anyone else who can help,

So an update re this issue. I've tried DeepSpeed 3 stage. And I see timeouts from NCCL operations. I didn't find exactly what operation times out. Tracing / debugging so far tells that it gets to the 713 line with model_inputs["input_ids"] = self.accelerator.pad_across_processes(... in ppo_trainer.py. At that point, I saw time outs when it tried to allocate 1EB (see above) in CUDA and got OOM. Now, I guess it passes through that somehow but hangs somewhere else. All this is painful and takes long time. As I wrote I can't use PPO trainer v2 . I have many options and ideas to investigate. Thank you for your thoughts on how can i fix it.

Here is the log from the last run. It fails after 20 min of waiting:

2024-06-20 11:41:34,839:260555:__main__:DEBUG Tensors, 3: q.shape=torch.Size([278]) q.device=device(type='cuda', index=2) response.shape=torch.Size([130]) response.device=device(type='cuda', index=2)reward=tensor(0.1001, device='cuda:2', dtype=torch.bfloat16)
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800044 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 775693, last enqueued NCCL work: 775693, last completed NCCL work: 775692.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ffadf608897 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ffae08e1c62 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ffae08e6a80 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ffae08e7dcc in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ffb2c39fbf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7ffb3457bea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7ffb33b9bb2d in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800022 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 775693, last enqueued NCCL work: 775693, last completed NCCL work: 775692.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800058 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f34f7a64897 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f34f8d3dc62 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f34f8d42a80 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f34f8d43dcc in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f35447fbbf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f354c9d7ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f354bff7b2d in /lib64/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 775693, last enqueued NCCL work: 775693, last completed NCCL work: 775692.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800044 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6bc5247897 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f6bc6520c62 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6bc6525a80 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6bc6526dcc in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f6c11fdebf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f6c1a1baea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f6c197dab2d in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 775693, last enqueued NCCL work: 775693, last completed NCCL work: 775692.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=775693, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800022 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f91617897 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f2f928f0c62 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2f928f5a80 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2f928f6dcc in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f2fde3aebf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f2fe658aea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2fe5baab2d in /lib64/libc.so.6)

  0%|          | 0/200 [42:10<?, ?it/s]
W0620 12:11:55.877000 140572652951360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 260553 closing signal SIGTERM
W0620 12:11:55.877000 140572652951360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 260554 closing signal SIGTERM
W0620 12:11:55.878000 140572652951360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 260555 closing signal SIGTERM
E0620 12:11:56.592000 140572652951360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 3 (pid: 260556) of binary: /data/artyom_karpov/rl4steg/.venv/bin/python
[93m [WARNING] [0m async_io requires the dev libaio .so object and headers but these were not found.
[93m [WARNING] [0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[93m [WARNING] [0m Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[93m [WARNING] [0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[93m [WARNING] [0m using untested triton version (2.3.1), only 1.0.0 is known to be compatible
Traceback (most recent call last):
  File "/data/artyom_karpov/rl4steg/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    deepspeed_launcher(args)
  File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
    distrib_run.run(args)
  File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-20_12:11:55
  host      : compute-permanent-node-990.local.vcn
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 260556)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 260556
=======================================================

vwxyzjn commented 3 months ago

@vwxyzjn Sorry for the half-baked bug report. See it at https://github.com/huggingface/accelerate/issues/2868 . I thought to use PPOv2Trainer but it doesn't suit me because I do inference twice for my policy and use both results to assess, i.e. res1 = policy(x), res2 = policy(res1, x), and then I compute reward using another model and those two results. But PPOv2Trainer allows only one inference as I see it.

To support your use case, how about making a copy of the PPOv2Trainer to do your desired changes?

artkpv commented 3 months ago

@vwxyzjn , I'm thinking about this. But limitations are as follows as I see it, please feel free to correct me

PPO trainer v2:

It needs to load ref model to do KL divergence, which takes additional memory. I saw v1 doesn't do it if it is PEFT.
Coupling of the way the policy generates continuations inside train method which makes it harder to do something like res1 = policy(X), res2 = policy(res1, X).
I don't have reward model locally, I mean I use API of other models (GPT and other) to get me a reward.
Padding. I see V2 uses no padding but concatenates query and response as they did in the original paper. But I don't want my reward model to see query of the policy.
(And something else)

Anyway, I'm thinking how to work this out. Thanks

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / trl

FSDP with PPO trainer won't work because FSDP doesn't support model.generate #1726