[Feature Request]: Support UVM for embedding

skyw commented 2 years ago

The feature, motivation and pitch

Embedding tables can be very large and requires a lot of GPU to store the embedding. Although with high speed interconnect, it is still the fastest way, but the cost is high. Offloading large embedding to CPU is a cost effective way to run large models. In addition, we can take advantage of long tail distribution of input data. Only offload infrequently accessed embeddings to CPU to minimize the performance impact. UVM provides all the necessary functionalities. With application managed access count of embedding rows and UVM hints (cuMemAdvise), it can achieve very good performance. We are able to train MLPerf DLRM in 28 minutes on single A100 compare to 4 minutes on a DGXA100 (8xA100) machine.

What will be in the PR if accepted

UVM tensor

A uvm tensor which uses cudaMallocManaged() to allocate memory. It operates like normal GPU tensor from Pytorch's point of view.

UVM hints

Python wrapper over CudaMemPrefetch() and CudaMemAdvise() to set hints to underlining memory of UVM tensor.

Primary pytorch functions definitions

torch::Tensor UvmTensor(at::IntArrayRef shape, int gpu_id);
void CudaMemPrefetch(torch::Tensor uvm_tensor, int64_t offset, int64_t numel, int device);
void CudaMemAdvise(torch::Tensor uvm_tensor, int64_t offset, int64_t numel, std::string advise_str, int device);

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("UvmTensor", &UvmTensor, "", py::arg("shape"), py::arg("gpu_id") = 0);
  m.def(
      "cuda_mem_prefetch", &CudaMemPrefetch, "",
      py::arg("uvm_tensor"), py::arg("offset"), py::arg("numel"), py::arg("device") = 0);
  m.def(
      "cuda_mem_advise", &CudaMemAdvise, "",
      py::arg("uvm_tensor"), py::arg("offset"), py::arg("numel"), py::arg("advise_str"), py::arg("device") = 0);
}

examples

MLPerf DLRM plus other variants.

xing-liu commented 2 years ago

TorchRec already supports UVM. We will publish some examples very soon.

skyw commented 2 years ago

TorchRec already supports UVM. We will publish some examples very soon.

Great!

Does it supports hint? I searched cudaMemAdvise and cudaMemPrefetchAsync didn't find anything. Using UVM without hint could be very slow, sometimes even slower than reading everything from CPU directly.

skyw commented 2 years ago

For comparison, TorchRec currently trains mlperf DLRM in 12m43 on single DGXA100 per https://github.com/pytorch/torchrec/tree/main/examples/dlrm#preliminary-training-results, compare to 28min on single A100 with the right hint.

xing-liu commented 2 years ago

We have implemented software-managed cache (LRU and LFU) in CUDA and don't reply on hint for good performance.

skyw commented 2 years ago

I'll wait the example and benchmark numbers then. The numbers in previous arxiv paper (probably out of date) wasn't great.

colin2328 commented 2 years ago

@skyw -> btw, check out fbgemm repo (https://github.com/pytorch/FBGEMM), which is where the CUDA kernels powering torchrec live

colin2328 commented 2 years ago

@skyw : please check out https://github.com/pytorch/torchrec/pull/156/files cc @jasonjk-park

skyw commented 2 years ago

@skyw : please check out https://github.com/pytorch/torchrec/pull/156/files cc @jasonjk-park

thanks colin, looking. BTW do you have performance numbers, like DLRM on single GPU or fewer than 8 GPU?

colin2328 commented 2 years ago

@skyw FYI, this (an example of UVM and UVM caching using a single GPU (A100 in this case)) is available here: https://github.com/pytorch/torchrec/blob/main/examples/sharding/uvm.ipynb

Our current performance numbers for DLRM are here: https://github.com/pytorch/torchrec/tree/main/examples/dlrm We do train the ML perf DLRM on 8 A100s. We don't train on a single A100, although this should be easily runnable by passing a single process. We can try running this and add the results to the table cc @s4ayub

s4ayub commented 2 years ago

@skyw We've updated our perf numbers to show 8 GPU, 4 GPU and 1 GPU setup: https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/#preliminary-training-results

skyw commented 2 years ago

@skyw We've updated our perf numbers to show 8 GPU, 4 GPU and 1 GPU setup: https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/#preliminary-training-results

Thanks. MLperf DLRM 1h35m29s on single A100 does look ok performance. About 3x slower than a very optimized version, similar as 8GPU torchrec performance vs nvidia mlperf pytorch submission ratio.

xing-liu commented 2 years ago

The slower perf is mainly because 1) TorchRec DLRM uses fp32 MLPs but NV uses amp. 2) NV's embedding kernel is optimized for 1-hot encoded features.

pytorch / torchrec