Closed skyw closed 2 years ago
TorchRec already supports UVM. We will publish some examples very soon.
TorchRec already supports UVM. We will publish some examples very soon.
Great!
Does it supports hint? I searched cudaMemAdvise
and cudaMemPrefetchAsync
didn't find anything. Using UVM without hint could be very slow, sometimes even slower than reading everything from CPU directly.
For comparison, TorchRec currently trains mlperf DLRM in 12m43 on single DGXA100 per https://github.com/pytorch/torchrec/tree/main/examples/dlrm#preliminary-training-results, compare to 28min on single A100 with the right hint.
We have implemented software-managed cache (LRU and LFU) in CUDA and don't reply on hint for good performance.
I'll wait the example and benchmark numbers then. The numbers in previous arxiv paper (probably out of date) wasn't great.
@skyw -> btw, check out fbgemm repo (https://github.com/pytorch/FBGEMM), which is where the CUDA kernels powering torchrec live
@skyw : please check out https://github.com/pytorch/torchrec/pull/156/files cc @jasonjk-park
@skyw : please check out https://github.com/pytorch/torchrec/pull/156/files cc @jasonjk-park
thanks colin, looking. BTW do you have performance numbers, like DLRM on single GPU or fewer than 8 GPU?
@skyw FYI, this (an example of UVM and UVM caching using a single GPU (A100 in this case)) is available here: https://github.com/pytorch/torchrec/blob/main/examples/sharding/uvm.ipynb
Our current performance numbers for DLRM are here: https://github.com/pytorch/torchrec/tree/main/examples/dlrm We do train the ML perf DLRM on 8 A100s. We don't train on a single A100, although this should be easily runnable by passing a single process. We can try running this and add the results to the table cc @s4ayub
@skyw We've updated our perf numbers to show 8 GPU, 4 GPU and 1 GPU setup: https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/#preliminary-training-results
@skyw We've updated our perf numbers to show 8 GPU, 4 GPU and 1 GPU setup: https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/#preliminary-training-results
Thanks. MLperf DLRM 1h35m29s on single A100 does look ok performance. About 3x slower than a very optimized version, similar as 8GPU torchrec performance vs nvidia mlperf pytorch submission ratio.
The slower perf is mainly because 1) TorchRec DLRM uses fp32 MLPs but NV uses amp. 2) NV's embedding kernel is optimized for 1-hot encoded features.
The feature, motivation and pitch
Embedding tables can be very large and requires a lot of GPU to store the embedding. Although with high speed interconnect, it is still the fastest way, but the cost is high. Offloading large embedding to CPU is a cost effective way to run large models. In addition, we can take advantage of long tail distribution of input data. Only offload infrequently accessed embeddings to CPU to minimize the performance impact. UVM provides all the necessary functionalities. With application managed access count of embedding rows and UVM hints (cuMemAdvise), it can achieve very good performance. We are able to train MLPerf DLRM in 28 minutes on single A100 compare to 4 minutes on a DGXA100 (8xA100) machine.
What will be in the PR if accepted
UVM tensor
A uvm tensor which uses cudaMallocManaged() to allocate memory. It operates like normal GPU tensor from Pytorch's point of view.
UVM hints
Python wrapper over
CudaMemPrefetch()
andCudaMemAdvise()
to set hints to underlining memory of UVM tensor.Primary pytorch functions definitions
examples
MLPerf DLRM plus other variants.