RecSys '22 Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference

training用static cache. 其实就是一个split. inference用dynamic cache (LRU).

training因为embedding一直要动态变化，想实现cache coherence太复杂了，CPU上有一份，GPU上有一份，update一直要write through to CPU，导致很多traffic，最后很可能压根没有减少traffic.

而static cache，不存在multiple copies，GPU的embedding一直在GPU上，更新也在GPU上. 如果是multi-GPU，我想也应该只有一份copy，gpu embedding table应该share. 不过hugectr做的是single-gpu cache.

jasperzhong / read-papers-and-code