msr-fiddle / dejavu

Apache License 2.0
10 stars 2 forks source link

how the decoder instance recv the kv cache from the prefill instance?? #4

Closed pipul closed 2 months ago

pipul commented 2 months ago

in the prefill phase,when each layer finish, ParallelGptContextDecoder.cc copy_kv_cache_ubatch_layer() function will call steam_out send the layer kv cache to the decode instance

but i can't see any code for how to recv the kv cache from the prefill instance ??

fotstrt commented 2 months ago

Hi, thank you for your comment!

Whenever the token generation worker needs a new prompt:

Next, the way that receive is happening in the token generation worker depends on whether we use MPI RDMA or BOOST send-recv for communication:

  1. In the case of MPI: the prefill worker will use MPI_Put to write the KV cache to the CPU memory of the the token generation worker. In that case, the token generation worker will just copy the KV cache from its CPU to GPU memory
  2. In the case of BOOST:
    • The fetch itself is again a memcpy
    • But since now we use boost read/write for communication, we define a background thread that is responsible for receiving prompts as shown here

I hope that answers your question!