how the decoder instance recv the kv cache from the prefill instance??

Hi, thank you for your comment!

Whenever the token generation worker needs a new prompt:

It checks if there are any new "slots" available: https://github.com/msr-fiddle/dejavu/blob/master/src/fastertransformer/models/multi_gpu_gpt/ParallelGptDVFT.cc#L3098 (note that the prefill worker marks slots in the token worker's memory when about to send a new prompt)
As soon as a new slot becomes available, the token worker calls the receive_cache_ubatch function, which will call the stream_in function which will eventually call a fetch function.

Next, the way that receive is happening in the token generation worker depends on whether we use MPI RDMA or BOOST send-recv for communication:

In the case of MPI: the prefill worker will use MPI_Put to write the KV cache to the CPU memory of the the token generation worker. In that case, the token generation worker will just copy the KV cache from its CPU to GPU memory
In the case of BOOST:
- The fetch itself is again a memcpy
- But since now we use boost read/write for communication, we define a background thread that is responsible for receiving prompts as shown here

I hope that answers your question!

msr-fiddle / dejavu