Why use point-to-point communication (two GPU on a single machine)? Sending kvcache after each layer of model inference takes a long time, 2000 tokens take 0.5 seconds.
Could anyone provide some suggestions to help me optimize my NCCL code for transmitting KV cache to improve performance?
I put the kvcache tensor split information in a list(send_metas), and send and receive it in a loop at a certain stage。
Why use point-to-point communication (two GPU on a single machine)? Sending kvcache after each layer of model inference takes a long time, 2000 tokens take 0.5 seconds.
Could anyone provide some suggestions to help me optimize my NCCL code for transmitting KV cache to improve performance?
I put the kvcache tensor split information in a list(send_metas), and send and receive it in a loop at a certain stage。
The process of sending:
The process of recv:
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o