Why is kvcache transfer slow using peer-to-peer communication (two GPUs on one machine).

Why use point-to-point communication (two GPU on a single machine)? Sending kvcache after each layer of model inference takes a long time, 2000 tokens take 0.5 seconds.

Could anyone provide some suggestions to help me optimize my NCCL code for transmitting KV cache to improve performance?

I put the kvcache tensor split information in a list（send_metas）, and send and receive it in a loop at a certain stage。

The process of sending：

>   for send_meta in send_metas:
>           sendgpu_id = send_meta['sendgpu_id']
>           recvgpu_id = send_meta[recvgpu_id]
>           layer_id = send_meta[layer_id]
>           block_id = send_meta[block_id]
>           tensor_tmp = send_meta[layer_id][kv, block_id, :]
>           if(rank==sendgpu_id):
>                  dist.send(tensor=tensor_tmp, dst=recvgpu_id)

The process of recv：


> for recv_meta in recv_metas:
>         sendgpu_id = recv_meta['sendgpu_id']
>         recvgpu_id = recv_meta[recvgpu_id]
>         layer_id = recv_meta[layer_id]
>         block_id = recv_meta[block_id]
>         tensor_tmp = send_meta[layer_id][kv, block_id, :]
>         if(reak==recvgpu_id):
>                 dist.recv(tensor=tensor_tmp, dst=sendgpu_id)

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch / pytorch

Why is kvcache transfer slow using peer-to-peer communication (two GPUs on one machine). #132747