Open artetaout opened 3 days ago
You can try to check the connectivity between two nodes (different config files in TCP & RDMA mode). The "KV send DONE" message just implies that the KVCache entry has been submitted, rather than delivered by remote.
thanks for reply, but i run on single one node, so does this mean the prefiller with etcd works well and something wrong with the decoder and etcd ? and how to check them?
possibly it is etcd that does not run well? i run etcd with the command in doc and test them with etcd client api successfully
The problem may be caused by incorrect confs (e.g., the mooncake.json
file, env variables). You can try to recheck:
mooncake.json
file, prefill_url
and decode_url
should match the env variables VLLM_HOST_IP="192.168.0.137" VLLM_PORT="51000"
in both prefill & decode side. Use different ports if you perform them in one machine.protocol
should be set as tcp
mooncake.json
file should be usually exact (don't need to swap prefill_urland
decode_url` fields)If the above steps cannot solve your problem, you can try to run our Transfer Engine Bench with --protocol=tcp
.
I run on one single node, can you help me correct it ? the IP and port settings must be wrong in somewhere, here's command I used
the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
the mooncake.json
{
"prefill_url": "192.168.0.2:13002",
"decode_url": "192.168.0.2:14002",
"metadata_server": "192.168.0.2:2379",
"protocol": "tcp",
"device_name": ""
}
the prefill command
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0 HF_ENDPOINT=https://hf-mirror.com VLLM_HOST_IP="192.168.0.2" VLLM_PORT="51000" MASTER_ADDR="192.168.0.2" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.95
the decode command
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=1 HF_ENDPOINT=https://hf-mirror.com VLLM_HOST_IP="192.168.0.2" VLLM_PORT="51000" MASTER_ADDR="192.168.0.2" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.95
and the TransferEngineBench works well,
transfer_engine_bench --mode=initiator --metadata_server=192.168.0.2:2379 --local_server_name=192.168.0.2:12346 --segment_id=192.168.0.2:12345 --protocol=tcp
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1202 06:36:55.500823 3108 transfer_engine_bench.cpp:182] Worker 3 stopped!
I1202 06:36:55.500823 3106 transfer_engine_bench.cpp:182] Worker 1 stopped!
I1202 06:36:55.503688 3107 transfer_engine_bench.cpp:182] Worker 2 stopped!
I1202 06:36:55.503718 3105 transfer_engine_bench.cpp:182] Worker 0 stopped!
I1202 06:36:55.503857 3100 transfer_engine_bench.cpp:293] Test completed: duration 10.00, batch count 5700, throughput 0.30
I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:
def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
"""Set up ZeroMQ sockets for sending and receiving data."""
if rank_in_group == 0:
self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
else:
self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")
Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):
{
"metadata_server": "127.0.0.1:2379",
"prefill_url": "127.0.0.1:31287",
"decode_url": "127.0.0.1:31282",
"protocol": "rdma",
"device_name": "erdma_1"
}
The above changes resolved the issue
I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:
def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None: """Set up ZeroMQ sockets for sending and receiving data.""" if rank_in_group == 0: self.sender_socket.bind(f"tcp://*:{int(port) + 1}") self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}") self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}") self.receiver_ack.bind(f"tcp://*:{int(port) + 4}") else: self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}") self.sender_socket.bind(f"tcp://*:{int(port) - 3}") self.receiver_ack.bind(f"tcp://*:{int(port) - 2}") self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")
Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):
{ "metadata_server": "127.0.0.1:2379", "prefill_url": "127.0.0.1:31287", "decode_url": "127.0.0.1:31282", "protocol": "rdma", "device_name": "erdma_1" }
The above changes resolved the issue
Thanks for the digging. This code was originally used to run the inter-node disaggregated prefill demo by default, so we did not consider the port occupation problem of multiple instances on the same node, which will be solved in the future. FYI, if you want to run a disaggregated prefill demo on the same node, you can try https://github.com/vllm-project/vllm/pull/10502, which has already been merged into the main branch of vLLM.
I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:
def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None: """Set up ZeroMQ sockets for sending and receiving data.""" if rank_in_group == 0: self.sender_socket.bind(f"tcp://*:{int(port) + 1}") self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}") self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}") self.receiver_ack.bind(f"tcp://*:{int(port) + 4}") else: self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}") self.sender_socket.bind(f"tcp://*:{int(port) - 3}") self.receiver_ack.bind(f"tcp://*:{int(port) - 2}") self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")
Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):
{ "metadata_server": "127.0.0.1:2379", "prefill_url": "127.0.0.1:31287", "decode_url": "127.0.0.1:31282", "protocol": "rdma", "device_name": "erdma_1" }
The above changes resolved the issue
Thanks for the digging. This code was originally used to run the inter-node disaggregated prefill demo by default, so we did not consider the port occupation problem of multiple instances on the same node, which will be solved in the future. FYI, if you want to run a disaggregated prefill demo on the same node, you can try vllm-project/vllm#10502, which has already been merged into the main branch of vLLM.
Hello, could you help me check the correctness of the configuration? When I run the following configuration on a single machine, the decode process gets blocked at this step: Initializing an LLM engine.
{
"prefill_url": "127.0.0.1:31287",
"decode_url": "127.0.0.1:31282",
"metadata_server": "127.0.0.19:2379",
"protocol": "tcp",
"device_name": ""
}
CUDA_VISIBLE_DEVICES=4 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22301" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8200 --max-model-len 32000 --gpu-memory-utilization 0.95 > log/decode.log 2>&1 &
CUDA_VISIBLE_DEVICES=3 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8100 --max-model-len 16000 --gpu-memory-utilization 0.95 > log/producer.log 2>&1 &
Could you please tell me whether the VLLM_PORT settings for prefill and decode in a single machine need to be different?
the prefill already send KV
but the decoder, stucked in drop_select
did i miss something?