Closed pipul closed 2 months ago
Hi,
Since all of our proposed solutions involve copies to/from CPU memory, we didn't end up using NCCLCacheManager
for actual transfering. We define it as:
1) we want to evaluate the efficiency of streaming in NCCL p2p scenarios as well
2) it is useful for future implementations of direct GPU-GPU communication, since we provide the basic primitives to build upon
Please note that the main reason we did not use NCCL p2p when doing prompt-token disaggregation was that we want to pipeline prompt transferring of some requests with token generation of other requests. Using direct GPU-GPU copies would mean that we need to keep the KV cache of more requests in GPU memory, which would add extra memory pressure.
i grep the all source code, it seems that only ParallelGptDVBenchmark.cc used NCCLCacheManager to transfer the kv cache ParallelGptDVBenchmark.cc looks like a test code ???