vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.97k stars 3.25k forks source link

Watchdog caught collective operation timeout for llama2-70b-chat on tp=8 #2734

Open flexwang opened 5 months ago

flexwang commented 5 months ago

See full log below. It can handle the first few requests and then getting stuck

2024-02-03 09:54:56,181 INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-03 09:54:57 llm_engine.py:70] Initializing an LLM engine with config: model='/models/llama2-70b-chat_tp=8_20240131200023421_3c3f075a-d95c-44f1-93e4-8dd63d09832c', tokenizer='/models/llama2-70b-chat_tp=8_20240131200023421_3c3f075a-d95c-44f1-93e4-8dd63d09832c', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=8, quantization=None, enforce_eager=False, seed=0)
[pod-name]:1:1 [0] NCCL INFO Bootstrap : Using eth0:10.4.22.79<0>
[pod-name]:1:1 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.18.1+cuda12.1
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:4773 [1] NCCL INFO cudaDriverVersion 12010
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:4773 [1] NCCL INFO Bootstrap : Using eth0:10.4.22.79<0>
[pod-name]:1:5421 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[pod-name]:1:5421 [0] NCCL INFO P2P plugin IBext
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO P2P plugin IBext
[pod-name]:1:5421 [0] NCCL INFO NET/IB : No device found.
[pod-name]:1:5421 [0] NCCL INFO NET/IB : No device found.
[pod-name]:1:5421 [0] NCCL INFO NET/Socket : Using [0]eth0:10.4.22.79<0>
[pod-name]:1:5421 [0] NCCL INFO Using network Socket
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO NET/IB : No device found.
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO NET/IB : No device found.
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO NET/Socket : Using [0]eth0:10.4.22.79<0>
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO Using network Socket
[pod-name]:1:5421 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
[pod-name]:1:5421 [0] NCCL INFO NVLS multicast support is not available on dev 0
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO NVLS multicast support is not available on dev 1
[pod-name]:1:5421 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:5421 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
[pod-name]:1:5421 [0] NCCL INFO P2P Chunksize set to 524288
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO P2P Chunksize set to 524288
 [36m(RayWorkerVllm pid=5085) [0m [pod-name]:5085:5435 [4] NCCL INFO Channel 00/0 : 4[901c0] -> 5[901d0] via P2P/IPC/read
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5310 [7] NCCL INFO cudaDriverVersion 12010 [32m [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5310 [7] NCCL INFO Bootstrap : Using eth0:10.4.22.79<0> [32m [repeated 6x across cluster] [0m
[pod-name]:1:5421 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO P2P plugin IBext [32m [repeated 6x across cluster] [0m
[pod-name]:1:5421 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO NET/IB : No device found. [32m [repeated 12x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO NET/Socket : Using [0]eth0:10.4.22.79<0> [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO Using network Socket [32m [repeated 6x across cluster] [0m
[pod-name]:1:5421 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:5421 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
 [36m(RayWorkerVllm pid=5085) [0m [pod-name]:5085:5435 [4] NCCL INFO Connected all rings
[pod-name]:1:5421 [0] NCCL INFO Connected all rings
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO NVLS multicast support is not available on dev 7 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO P2P Chunksize set to 524288 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO Channel 06/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read [32m [repeated 229x across cluster] [0m
[pod-name]:1:5421 [0] NCCL INFO Connected all trees
[pod-name]:1:5421 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
[pod-name]:1:5421 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO Connected all trees
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
[pod-name]:1:5421 [0] NCCL INFO comm 0x558ecef9b620 rank 0 nranks 8 cudaDev 0 busId 101c0 commId 0x553d0a1a3119e239 - Init COMPLETE
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:5434 [1] NCCL INFO comm 0x5588a64707b0 rank 1 nranks 8 cudaDev 1 busId 101d0 commId 0x553d0a1a3119e239 - Init COMPLETE
[pod-name]:1:7510 [0] NCCL INFO Using network Socket
 [36m(RayWorkerVllm pid=4851) [0m [pod-name]:4851:7512 [2] NCCL INFO Using network Socket
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO Connected all rings [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5235) [0m [pod-name]:5235:5432 [6] NCCL INFO Channel 23/0 : 6[a01c0] -> 5[901d0] via P2P/IPC/read [32m [repeated 106x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO Connected all trees [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:5433 [7] NCCL INFO comm 0x5562ccc7ef40 rank 7 nranks 8 cudaDev 7 busId a01d0 commId 0x553d0a1a3119e239 - Init COMPLETE [32m [repeated 6x across cluster] [0m
[pod-name]:1:7510 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
[pod-name]:1:7510 [0] NCCL INFO NVLS multicast support is not available on dev 0
 [36m(RayWorkerVllm pid=5085) [0m [pod-name]:5085:7514 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
 [36m(RayWorkerVllm pid=5085) [0m [pod-name]:5085:7514 [4] NCCL INFO NVLS multicast support is not available on dev 4
[pod-name]:1:7510 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
[pod-name]:1:7510 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
[pod-name]:1:7510 [0] NCCL INFO P2P Chunksize set to 524288
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO P2P Chunksize set to 524288
[pod-name]:1:7510 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO Using network Socket [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO Channel 04/0 : 1[101d0] -> 2[201c0] via P2P/IPC/read [32m [repeated 30x across cluster] [0m
[pod-name]:1:7510 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
[pod-name]:1:7510 [0] NCCL INFO Connected all rings
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO Connected all rings
 [36m(RayWorkerVllm pid=5010) [0m [pod-name]:5010:7515 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5010) [0m [pod-name]:5010:7515 [3] NCCL INFO NVLS multicast support is not available on dev 3 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO P2P Chunksize set to 524288 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO Channel 11/0 : 1[101d0] -> 0[101c0] via P2P/IPC/read [32m [repeated 225x across cluster] [0m
[pod-name]:1:7510 [0] NCCL INFO Connected all trees
[pod-name]:1:7510 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
[pod-name]:1:7510 [0] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO Connected all trees
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
[pod-name]:1:7510 [0] NCCL INFO comm 0x558f1b71df40 rank 0 nranks 8 cudaDev 0 busId 101c0 commId 0xee91d00f9a25e09b - Init COMPLETE
 [36m(RayWorkerVllm pid=4773) [0m [pod-name]:4773:7516 [1] NCCL INFO comm 0x5588f4138b50 rank 1 nranks 8 cudaDev 1 busId 101d0 commId 0xee91d00f9a25e09b - Init COMPLETE
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO Connected all rings [32m [repeated 6x across cluster] [0m
INFO 02-03 10:08:05 llm_engine.py:275] # GPU blocks: 22682, # CPU blocks: 6553
INFO 02-03 10:08:08 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-03 10:08:08 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
 [36m(RayWorkerVllm pid=4773) [0m INFO 02-03 10:08:08 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
 [36m(RayWorkerVllm pid=4773) [0m INFO 02-03 10:08:08 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7541 [7] NCCL INFO Channel 31/1 : 7[a01d0] -> 0[101c0] via P2P/IPC/read [32m [repeated 305x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO Connected all trees [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO 24 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m [pod-name]:5310:7517 [7] NCCL INFO comm 0x5562d7923cb0 rank 7 nranks 8 cudaDev 7 busId a01d0 commId 0xee91d00f9a25e09b - Init COMPLETE [32m [repeated 6x across cluster] [0m
[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
 [36m(RayWorkerVllm pid=4773) [0m [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
 [36m(RayWorkerVllm pid=5160) [0m INFO 02-03 10:08:43 model_runner.py:547] Graph capturing finished in 35 secs.
 [36m(RayWorkerVllm pid=5310) [0m INFO 02-03 10:08:08 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. [32m [repeated 6x across cluster] [0m
 [36m(RayWorkerVllm pid=5310) [0m INFO 02-03 10:08:08 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. [32m [repeated 6x across cluster] [0m
INFO 02-03 10:08:43 model_runner.py:547] Graph capturing finished in 35 secs.
INFO:root:took 829.11 seconds to start vllm engine for model llama2-70b-chat
INFO 02-03 10:09:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:11:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:14:10 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:15:30 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:18:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:21:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:23:11 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:27:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:34:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:37:19 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:39:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:43:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:44:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:47:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:51:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:54:33 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 10:57:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:00:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:02:14 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:05:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:15:10 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:16:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:18:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:20:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:29:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:31:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:36:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:39:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:49:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:50:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 11:53:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:01:18 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:02:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:08:31 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:10:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:16:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:17:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:22:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:24:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:29:12 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:37:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:42:08 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:54:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:56:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 12:59:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 13:01:15 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 13:07:19 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 13:09:11 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 13:12:03 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 13:14:04 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 02-03 13:18:39 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
 [36m(RayWorkerVllm pid=5085) [0m [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20518, OpType=ALLREDUCE, NumelIn=106496, NumelOut=106496, Timeout(ms)=1800000) ran for 1800270 milliseconds before timing out.
 [36m(RayWorkerVllm pid=5310) [0m [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [32m [repeated 6x across cluster] [0m
WangErXiao commented 4 months ago

hi, have you fixed it?