Is NCCLCacheManager only used in test scenarios???

Hi,

Since all of our proposed solutions involve copies to/from CPU memory, we didn't end up using NCCLCacheManager for actual transfering. We define it as: 1) we want to evaluate the efficiency of streaming in NCCL p2p scenarios as well 2) it is useful for future implementations of direct GPU-GPU communication, since we provide the basic primitives to build upon

Please note that the main reason we did not use NCCL p2p when doing prompt-token disaggregation was that we want to pipeline prompt transferring of some requests with token generation of other requests. Using direct GPU-GPU copies would mean that we need to keep the KV cache of more requests in GPU memory, which would add extra memory pressure.

msr-fiddle / dejavu

Is NCCLCacheManager only used in test scenarios??? #5