Closed msharmavikram closed 2 years ago
EDIT: I reread your q more carefully. No, we don't have any examples of sharding across both host memory and device memory, although the library should support this. Can you elaborate on your use case?
We do support host-memory placement of embedding tables with GPU LRU/LFU cache. Will work with @colin2328 to create some examples.
There are two ways to enable this. First - via pinned memory API and second - via sharding approach. The embedding table sharding approach won't be efficient as you need to identify what features (or keys) are needed using a preprocessing step and the maximum number of features that you can enable is bounded by the GPU memory. Alternative to the sharding approach is the pinned or zero-copy memory approach (not the unified memory). But this will require either enabling custom torch script or modifications to PyTorch (PyTorch natively does not provide pinned memory)
From what @xing-liu described it seems like the sharding approach is used. Is that the right understanding?
Commenting on use case: let's take the full-size embedding table training described in DLRM example: https://github.com/pytorch/torchrec/tree/main/examples/dlrm
Clearly, if I want to run this, I will need a number of GPUs (80 is too high in my opinion. Maybe somewhere around 40-40GB A100s assuming ADA optimizer). However, getting access to that many resources as a researcher is a nightmare :) So the use case is simple. As a researcher who does not have access to such a large cluster, how can we enable single GPU inference/training using host memory?
(also - can this be labelled as an enhancement request?)
It is actually using CUDA UVM, and we have developed custom GPU embedding lookup kernels in FBGEMM. The kernels also include software cache to cache popular embedding rows.
@msharmavikram - please see https://github.com/pytorch/torchrec/pull/156/files cc @jasonjk-park
@msharmavikram an example of using 1GPU w/ UVM, and with UVM caching is now landed: https://github.com/pytorch/torchrec/blob/main/examples/sharding/uvm.ipynb
Hopefully this resolves your question. Feel free to create a new issue if not
Hi there, I am wondering if torchrec provides any support or features for enabling recsys with embedding table size beyond a single GPU memory (not doing multi-GPU but using host-memory). I looked at the documents but could not find any material on this aspect.
If this is not supported, what is the plan for supporting this feature?