msr-fiddle / dejavu

Apache License 2.0
10 stars 2 forks source link

Is NCCLCacheManager only used in test scenarios??? #5

Closed pipul closed 2 months ago

pipul commented 2 months ago

i grep the all source code, it seems that only ParallelGptDVBenchmark.cc used NCCLCacheManager to transfer the kv cache ParallelGptDVBenchmark.cc looks like a test code ???

fotstrt commented 2 months ago

Hi,

Since all of our proposed solutions involve copies to/from CPU memory, we didn't end up using NCCLCacheManager for actual transfering. We define it as: 1) we want to evaluate the efficiency of streaming in NCCL p2p scenarios as well 2) it is useful for future implementations of direct GPU-GPU communication, since we provide the basic primitives to build upon

Please note that the main reason we did not use NCCL p2p when doing prompt-token disaggregation was that we want to pipeline prompt transferring of some requests with token generation of other requests. Using direct GPU-GPU copies would mean that we need to keep the KV cache of more requests in GPU memory, which would add extra memory pressure.