如何使用多卡生成数据？

blackblue9 commented 1 month ago

请问能同时使用多卡生成instruction和response吗？比如在使用llama3.1-8b-instruct模型时，将device设置为"0,1,2,3,4,5,6,7"似乎还是只是用了一张卡部署vllm服务生成语料，请问代码支持同时多卡部署多个服务生成数据吗？？谢谢

fly-dust commented 1 month ago

可以的吧，你需要同时把tensor parallel的数量调上去，可以看scripts文件夹里大一点的model的脚本

blackblue9 commented 1 month ago

请问每次使用模型生成instruction都出现这种问题是怎么回事呢？使用的是llama3.1-70b-instruct模型，参数为device="0,1,2,3,4,5,6,7"，tensor_parallel=8，total_prompts=${2:-50000}其他都是默认值，总是跑到一半就报错了。报错信息如下：

Processed prompts: 100%|██████████| 1/1 [00:37<00:00, 37.22s/it, est. speed input: 0.16 toks/s, output                                                                                                : 2180.11 toks/s]
 55%|█████▌    | 138/250 [1:24:34<1:08:41, 36.80s/it]
Processed prompts: 100%|██████████| 1/1 [00:37<00:00, 37.17s/it, est. speed input: 0.16 toks/s, output: 2133.45 toks/s]                                                                               Processed prompts: 100%|██████████| 1/1 [00:35<00:00, 35.49s/it, est. speed input: 0.17 toks/s, output: 2054.89 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:36<00:00, 36.43s/it, est. speed input: 0.16 toks/s, output: 2076.58 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:36<00:00, 36.41s/it, est. speed input: 0.16 toks/s, output: 2124.56 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:34<00:00, 34.55s/it, est. speed input: 0.17 toks/s, output: 1938.38 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:35<00:00, 35.97s/it, est. speed input: 0.17 toks/s, output: 2082.83 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:37<00:00, 37.33s/it, est. speed input: 0.16 toks/s, output: 2197.61 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:36<00:00, 36.36s/it, est. speed input: 0.17 toks/s, output: 2121.81 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:36<00:00, 36.10s/it, est. speed input: 0.17 toks/s, output: 2093.63 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:35<00:00, 35.88s/it, est. speed input: 0.17 toks/s, output: 2095.68 toks/s]
 59%|█████▉    | 148/250 [1:30:36<1:01:27, 36.16s/it] 35.88s/it, est. speed input: 0.17 toks/s, output: 2095(VllmWorkerProcess pid=421176) WARNING 08-01 10:24:19 shm_broadcast.py:404] No available block found in 60 second.          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
(VllmWorkerProcess pid=421174) WARNING 08-01 10:24:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:24:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:24:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:24:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:24:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:24:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:25:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:25:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:25:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:25:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:25:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:25:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:25:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:26:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:26:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:26:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:26:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:26:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:26:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:26:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:27:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:27:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:27:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:27:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:27:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:27:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:27:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:28:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:28:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:28:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:28:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:28:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:28:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:28:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:29:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:29:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:29:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:29:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:29:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:29:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:29:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:30:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:30:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:30:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:30:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:30:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:30:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:30:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:31:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:31:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:31:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:31:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:31:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:31:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:31:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:32:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:32:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:32:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:32:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:32:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:32:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:32:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421176) WARNING 08-01 10:33:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421172) WARNING 08-01 10:33:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421173) WARNING 08-01 10:33:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421171) WARNING 08-01 10:33:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421175) WARNING 08-01 10:33:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421174) WARNING 08-01 10:33:19 shm_broadcast.py:404] No available block found in 60 second.
(VllmWorkerProcess pid=421170) WARNING 08-01 10:33:19 shm_broadcast.py:404] No available block found in 60 second.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 1] Timeout at NCCL work: 1523382, last enqueued NCCL work: 1523382, last completed NCCL work: 1523381.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b3c6e897 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff4b4f47c62 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff4b4f4ca80 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff4b4f4ddcc in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ff500a02bf4 in /usr/local/miniconda3/envs/magpie/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff501cc5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff501a90133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 7] Timeout at NCCL work: 1523382, last enqueued NCCL work: 1523382, last completed NCCL work: 1523381.
[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b3c6e897 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff4b4f47c62 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff4b4f4ca80 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff4b4f4ddcc in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ff500a02bf4 in /usr/local/miniconda3/envs/magpie/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff501cc5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff501a90133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 3] Timeout at NCCL work: 1523382, last enqueued NCCL work: 1523382, last completed NCCL work: 1523381.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b3c6e897 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff4b4f47c62 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff4b4f4ca80 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff4b4f4ddcc in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ff500a02bf4 in /usr/local/miniconda3/envs/magpie/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff501cc5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff501a90133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600055 milliseconds before timing out.
[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 5] Timeout at NCCL work: 1523382, last enqueued NCCL work: 1523382, last completed NCCL work: 1523381.
[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600055 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b3c6e897 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff4b4f47c62 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff4b4f4ca80 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff4b4f4ddcc in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ff500a02bf4 in /usr/local/miniconda3/envs/magpie/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff501cc5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff501a90133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 2] Timeout at NCCL work: 1523382, last enqueued NCCL work: 1523382, last completed NCCL work: 1523381.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b3c6e897 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff4b4f47c62 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff4b4f4ca80 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff4b4f4ddcc in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ff500a02bf4 in /usr/local/miniconda3/envs/magpie/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff501cc5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff501a90133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 4] Timeout at NCCL work: 1523382, last enqueued NCCL work: 1523382, last completed NCCL work: 1523381.
[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b3c6e897 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff4b4f47c62 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff4b4f4ca80 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff4b4f4ddcc in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ff500a02bf4 in /usr/local/miniconda3/envs/magpie/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff501cc5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff501a90133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 6] Timeout at NCCL work: 1523382, last enqueued NCCL work: 1523382, last completed NCCL work: 1523381.
[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1523382, OpType=GATHER, NumelIn=368736, NumelOut=0, Timeout(ms)=600000) ran for 600098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b3c6e897 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7ff4b4f47c62 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7ff4b4f4ca80 in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7ff4b4f4ddcc in /usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7ff500a02bf4 in /usr/local/miniconda3/envs/magpie/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff501cc5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff501a90133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR 08-01 10:33:20 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 421170 died, exit code: -6
INFO 08-01 10:33:20 multiproc_worker_utils.py:123] Killing local vLLM worker processes

fly-dust commented 1 month ago

应该是NCCL的问题，和机器的环境有关，感觉可以到vllm的repo下面找一下答案。单卡可以运行8B的吗？

blackblue9 commented 1 month ago

单卡可以运行8b的，但是当同时把device和tensor_parallel调大的话就有问题了，例如device="0,1"tensor_parallel=2的时候,生成instruction没问题，但是生成response就有问题，报错如下：

Stop token ids: [128009, 128001, 128006, 128007]
Processed prompts: 100%|██████████| 1/1 [00:19<00:00, 19.22s/it, est. speed input: 0.26 toks/s, output: 3881.63 toks/s]
 20%|██        | 1/5 [00:19<01:16, 19.25s/it]Checkpoint saved. Total prompts: 200: 0.26 toks/s, output: 3881.63 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:18<00:00, 18.32s/it, est. speed input: 0.27 toks/s, output: 3812.72 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:19<00:00, 19.05s/it, est. speed input: 0.26 toks/s, output: 3924.88 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:20<00:00, 20.08s/it, est. speed input: 0.25 toks/s, output: 4160.85 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:18<00:00, 18.11s/it, est. speed input: 0.28 toks/s, output: 3602.21 toks/s]
100%|██████████| 5/5 [01:34<00:00, 18.96s/it]8<00:00, 18.11s/it, est. speed input: 0.28 toks/s, output: 3602.21 toks/s]
Instruction generated from /mnt/tenant-home_speed/AIM/model/llama3.1-8B-instruct. Total prompts: 1000
INFO 08-01 16:02:57 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x000000000209ce20)

Current thread 0x00007fe2a51554c0 (most recent call first):
  <no Python frame>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zstandard.backend_c, PIL._imaging, zmq.backend.cython._zmq (total: 40)
/usr/local/miniconda3/envs/magpie/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
magpie-llama3.1-8b.sh: line 55: 627220 Aborted                 CUDA_VISIBLE_DEVICES=$device python ../exp/gen_ins.py --device $device --model_path $model_path --total_prompts $total_prompts --top_p $ins_topp --temp $ins_temp --tensor_parallel $tensor_parallel --gpu_memory_utilization $gpu_memory_utilization --disable_early_stopping --sanitize --logits_processor --n $n --job_name $job_name --timestamp $timestamp --max_tokens 1024
[magpie.sh] Finish Generating Instructions!
[magpie.sh] Start Generating Responses...
Response Generation Manager. Arguments: Namespace(model_path='/mnt/tenant-home_speed/AIM/model/llama3.1-8B-instruct', input_file='../data/llama3.1-8B-instruct_topp1_temp0.9_1722499246/Magpie_llama3.1-8B-instruct_1000_1722499246_ins.json', batch_size=200, checkpoint_every=20, api_url='https://api.together.xyz/v1/chat/completions', api_key=None, offline=True, engine='vllm', device='0,1', dtype='bfloat16', tensor_parallel_size=2, gpu_memory_utilization=0.95, max_tokens=4096, max_model_len=4096, temperature=0.6, top_p=0.9, repetition_penalty=1.0, tokenizer_template=False)
Start Local vllm engine...
INFO 08-01 16:03:06 config.py:715] Defaulting to use mp for distributed inference
INFO 08-01 16:03:06 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/mnt/tenant-home_speed/AIM/model/llama3.1-8B-instruct', speculative_config=None, tokenizer='/mnt/tenant-home_speed/AIM/model/llama3.1-8B-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/tenant-home_speed/AIM/model/llama3.1-8B-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-01 16:03:06 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=628644) INFO 08-01 16:03:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]   File "/usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]   File "/usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/vllm/worker/worker.py", line 123, in init_device
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]     torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]   File "/usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]     torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]   File "/usr/local/miniconda3/envs/magpie/lib/python3.10/site-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]     raise RuntimeError(
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=628644) ERROR 08-01 16:03:07 multiproc_worker_utils.py:226]

请问您用的flash_attn的版本是什么呢？看您requirements文件里写的vllm版本就是0.5.3.post1版本呀，感觉好像和vllm关系不大

fly-dust commented 1 month ago

您试过卸载flash attention嘛?

blackblue9 commented 1 month ago

卸载了仍然有问题，您没遇到过这个问题吗，好像设置tp=4或者=8的时候经常会有这种问题

fly-dust commented 1 month ago

我在slurm集群和本地机器都没有遇到过诶... 但我感觉像是vllm的问题，实在不行可以用transformer的engine，就是会慢一点

fly-dust commented 4 weeks ago

I will close this issue as complete since it is not active~

magpie-align / magpie

如何使用多卡生成数据？ #17