kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
https://arxiv.org/abs/2407.00079
Apache License 2.0
1.97k stars 99 forks source link

prefill is ok, but decode stucked #8

Open artetaout opened 3 days ago

artetaout commented 3 days ago

the prefill already send KV image

but the decoder, stucked in drop_select image

did i miss something?

alogfans commented 3 days ago

You can try to check the connectivity between two nodes (different config files in TCP & RDMA mode). The "KV send DONE" message just implies that the KVCache entry has been submitted, rather than delivered by remote.

artetaout commented 1 day ago

thanks for reply, but i run on single one node, so does this mean the prefiller with etcd works well and something wrong with the decoder and etcd ? and how to check them?

possibly it is etcd that does not run well? i run etcd with the command in doc and test them with etcd client api successfully

alogfans commented 1 day ago

The problem may be caused by incorrect confs (e.g., the mooncake.json file, env variables). You can try to recheck:

If the above steps cannot solve your problem, you can try to run our Transfer Engine Bench with --protocol=tcp.

artetaout commented 1 day ago

I run on one single node, can you help me correct it ? the IP and port settings must be wrong in somewhere, here's command I used

artetaout commented 1 day ago

and the TransferEngineBench works well,

transfer_engine_bench --mode=initiator --metadata_server=192.168.0.2:2379 --local_server_name=192.168.0.2:12346 --segment_id=192.168.0.2:12345 --protocol=tcp
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1202 06:36:55.500823  3108 transfer_engine_bench.cpp:182] Worker 3 stopped!
I1202 06:36:55.500823  3106 transfer_engine_bench.cpp:182] Worker 1 stopped!
I1202 06:36:55.503688  3107 transfer_engine_bench.cpp:182] Worker 2 stopped!
I1202 06:36:55.503718  3105 transfer_engine_bench.cpp:182] Worker 0 stopped!
I1202 06:36:55.503857  3100 transfer_engine_bench.cpp:293] Test completed: duration 10.00, batch count 5700, throughput 0.30
pansicheng commented 1 day ago

I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

The above changes resolved the issue

ShangmingCai commented 23 hours ago

I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

The above changes resolved the issue

Thanks for the digging. This code was originally used to run the inter-node disaggregated prefill demo by default, so we did not consider the port occupation problem of multiple instances on the same node, which will be solved in the future. FYI, if you want to run a disaggregated prefill demo on the same node, you can try https://github.com/vllm-project/vllm/pull/10502, which has already been merged into the main branch of vLLM.

liweiqing1997 commented 2 minutes ago

I encountered a similar issue and found that the code (https://github.com/kvcache-ai/vllm/blob/9c319eee04652df9be39377378fb569a6762935e/vllm/distributed/kv_transfer/kv_pipe/mooncake_distributed_pipe.py#L86) was causing the sender and receiver of the prefill and decod to not connect properly on one single node. I modified it as follows:


    def _setup_sockets(self, rank_in_group: int, host: str, port: str) -> None:
        """Set up ZeroMQ sockets for sending and receiving data."""
        if rank_in_group == 0:
            self.sender_socket.bind(f"tcp://*:{int(port) + 1}")
            self.receiver_socket.connect(f"tcp://{host}:{int(port) + 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) + 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) + 4}")
        else:
            self.receiver_socket.connect(f"tcp://{host}:{int(port) - 4}")
            self.sender_socket.bind(f"tcp://*:{int(port) - 3}")
            self.receiver_ack.bind(f"tcp://*:{int(port) - 2}")
            self.sender_ack.connect(f"tcp://{host}:{int(port) - 1}")

Additionally, I configured the mooncake.json as follows (prefill_port = decode_port + 5):

{
    "metadata_server": "127.0.0.1:2379",
    "prefill_url": "127.0.0.1:31287",
    "decode_url": "127.0.0.1:31282",
    "protocol": "rdma",
    "device_name": "erdma_1"
}

The above changes resolved the issue

Thanks for the digging. This code was originally used to run the inter-node disaggregated prefill demo by default, so we did not consider the port occupation problem of multiple instances on the same node, which will be solved in the future. FYI, if you want to run a disaggregated prefill demo on the same node, you can try vllm-project/vllm#10502, which has already been merged into the main branch of vLLM.

Hello, could you help me check the correctness of the configuration? When I run the following configuration on a single machine, the decode process gets blocked at this step: Initializing an LLM engine.

{

"prefill_url": "127.0.0.1:31287",

"decode_url": "127.0.0.1:31282",

"metadata_server": "127.0.0.19:2379",

"protocol": "tcp",

"device_name": ""

}

CUDA_VISIBLE_DEVICES=4 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22301" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8200 --max-model-len 32000 --gpu-memory-utilization 0.95 > log/decode.log 2>&1 &

CUDA_VISIBLE_DEVICES=3 VLLM_HOST_IP="127.0.0.1" VLLM_PORT="22300" MASTER_ADDR="127.0.0.1" MASTER_PORT="54324" MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer VLLM_USE_MODELSCOPE=True nohup python3 -m vllm.entrypoints.openai.api_server --model /mnt/data_disk101/data_disk/Qwen1.5-14B-Chat --port 8100 --max-model-len 16000 --gpu-memory-utilization 0.95 > log/producer.log 2>&1 &

Could you please tell me whether the VLLM_PORT settings for prefill and decode in a single machine need to be different?