pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.19k stars 22.68k forks source link

Why is kvcache transfer slow using peer-to-peer communication (two GPUs on one machine). #132747

Open liweiqing1997 opened 3 months ago

liweiqing1997 commented 3 months ago

Why use point-to-point communication (two GPU on a single machine)? Sending kvcache after each layer of model inference takes a long time, 2000 tokens take 0.5 seconds.

Could anyone provide some suggestions to help me optimize my NCCL code for transmitting KV cache to improve performance?

I put the kvcache tensor split information in a list(send_metas), and send and receive it in a loop at a certain stage。

The process of sending:

>   for send_meta in send_metas:
>           sendgpu_id = send_meta['sendgpu_id']
>           recvgpu_id = send_meta[recvgpu_id]
>           layer_id = send_meta[layer_id]
>           block_id = send_meta[block_id]
>           tensor_tmp = send_meta[layer_id][kv, block_id, :]
>           if(rank==sendgpu_id):
>                  dist.send(tensor=tensor_tmp, dst=recvgpu_id)

The process of recv:


> for recv_meta in recv_metas:
>         sendgpu_id = recv_meta['sendgpu_id']
>         recvgpu_id = recv_meta[recvgpu_id]
>         layer_id = recv_meta[layer_id]
>         block_id = recv_meta[block_id]
>         tensor_tmp = send_meta[layer_id][kv, block_id, :]
>         if(reak==recvgpu_id):
>                 dist.recv(tensor=tensor_tmp, dst=sendgpu_id)

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

fegin commented 3 months ago

Maybe you can bucket the send and recv if that is doable.