Closed jasperzhong closed 10 months ago
https://www.nvidia.cn/on-demand/session/gtccn2020-cns20626/
https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/gpu_cache/ReadMe.md
training用static cache. 其实就是一个split. inference用dynamic cache (LRU).
training因为embedding一直要动态变化,想实现cache coherence太复杂了,CPU上有一份,GPU上有一份,update一直要write through to CPU,导致很多traffic,最后很可能压根没有减少traffic.
而static cache,不存在multiple copies,GPU的embedding一直在GPU上,更新也在GPU上. 如果是multi-GPU,我想也应该只有一份copy,gpu embedding table应该share. 不过hugectr做的是single-gpu cache.
https://arxiv.org/pdf/2210.08803.pdf