pytorch / torchrec

Pytorch domain library for recommendation systems
https://pytorch.org/torchrec/
BSD 3-Clause "New" or "Revised" License
1.93k stars 430 forks source link

Query: single gpu support for very large embedding table #110

Closed msharmavikram closed 2 years ago

msharmavikram commented 2 years ago

Hi there, I am wondering if torchrec provides any support or features for enabling recsys with embedding table size beyond a single GPU memory (not doing multi-GPU but using host-memory). I looked at the documents but could not find any material on this aspect.

If this is not supported, what is the plan for supporting this feature?

colin2328 commented 2 years ago

EDIT: I reread your q more carefully. No, we don't have any examples of sharding across both host memory and device memory, although the library should support this. Can you elaborate on your use case?

xing-liu commented 2 years ago

We do support host-memory placement of embedding tables with GPU LRU/LFU cache. Will work with @colin2328 to create some examples.

msharmavikram commented 2 years ago

There are two ways to enable this. First - via pinned memory API and second - via sharding approach. The embedding table sharding approach won't be efficient as you need to identify what features (or keys) are needed using a preprocessing step and the maximum number of features that you can enable is bounded by the GPU memory. Alternative to the sharding approach is the pinned or zero-copy memory approach (not the unified memory). But this will require either enabling custom torch script or modifications to PyTorch (PyTorch natively does not provide pinned memory)

From what @xing-liu described it seems like the sharding approach is used. Is that the right understanding?

msharmavikram commented 2 years ago

Commenting on use case: let's take the full-size embedding table training described in DLRM example: https://github.com/pytorch/torchrec/tree/main/examples/dlrm

Clearly, if I want to run this, I will need a number of GPUs (80 is too high in my opinion. Maybe somewhere around 40-40GB A100s assuming ADA optimizer). However, getting access to that many resources as a researcher is a nightmare :) So the use case is simple. As a researcher who does not have access to such a large cluster, how can we enable single GPU inference/training using host memory?

(also - can this be labelled as an enhancement request?)

xing-liu commented 2 years ago

It is actually using CUDA UVM, and we have developed custom GPU embedding lookup kernels in FBGEMM. The kernels also include software cache to cache popular embedding rows.

colin2328 commented 2 years ago

@msharmavikram - please see https://github.com/pytorch/torchrec/pull/156/files cc @jasonjk-park

colin2328 commented 2 years ago

@msharmavikram an example of using 1GPU w/ UVM, and with UVM caching is now landed: https://github.com/pytorch/torchrec/blob/main/examples/sharding/uvm.ipynb

Hopefully this resolves your question. Feel free to create a new issue if not