NVIDIA new deep learning software releases

alreadydone commented 5 years ago

From NVIDIA newsletter today, there appears to be a lot of improvements for the Turing architecture (in particular for the GeForce RTX 20 series). I don't know whether these improvements has made cuDNN faster than OpenCL at lower batch sizes, but I want to point out the infrastructure to support huge batch size is ready.

New Deep Learning Software Releases

cuDNN 7.4 offers new features and performance improvements to deliver faster training. New features include:

Extended NHWC support for pooling and strided convolution
Batchnorm supports NHWC data layout with an added option to fuse batchnorm with Add and ReLu operations. This enables improved performance for common workloads such as ResNet50 and SSD
now supports strided convolution with FFT tiling algorithms and improves performance for dilated and Winograd Transform convolutions

CUTLASS 1.2, the latest version of the CUDA template library for linear algebra subroutines, includes the following key updates:

Support for Turing Tensor Cores that significantly speedup matrix computations for deep learning inference
Tensor Core optimized WMMA GEMMs for the new INT8, INT4, and INT1 precision modes introduced in Turing
Support for batched strided GEMMs, parallelized GEMM-K reductions, enhanced utilities, and samples

TensorRT 5, the latest version of NVIDIA's optimizer and runtime, provides new optimizations, APIs, and support for the new Turing architecture. Highlights include:

Speed up inference by 40x over CPUs for models such as translation using mixed precision on Turing Tensor Cores
Optimize inference models with new INT8 APIs and optimizations
Deploy applications to Xavier-based NVIDIA Drive platforms and the NVIDIA DLA accelerator (FP16 only)

It appears that cuDNN was released 1-2 days ago, cuTLASS was released 2 weeks ago, and TensorRT was release 1 week ago.

See also https://github.com/Chicoryn/dream-go/issues/35, Use cuTLASS instead of cuDNN for convolutions:

cuTLASS has a BSD-3 license, this would allow us to distribute statically linked binaries without legal trouble.
cuTLASS is a C++ template library, and our code is already in C++ These drawbacks probably apply to us as well:
cuTLASS is only cited as having ~95% of the performance of cuBLAS for integer kernels.
We need to implement the convolution kernels surrounding the GEMM ourselves.

I haven't checked @Ttl's code to see what are used in his cudnn branches, but @Chicoryn says: Beyond a lot of boilerplate code necessary to initialise tensor descriptors, etc, we only use the following functions from cuDNN or cuBLAS:

cudnnConvolutionBiasActivationForward (for 1x1 and 3x3 kernels)
cudnnActivationForward
cudnnAddTensor
cudnnScaleTensor
cudnnSoftmaxForward
cublasGemmEx

wonderingabout commented 5 years ago

interesting, but i read that many new features apply to rtx 2xxx+, which is not the majority of gpu personal cards atm

still, it would be interesting to implement this

on microsoft azure, microsoft provides a all in one deep learning custom images that includes nvidia, cudnn, etc, that may interest you :

https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.linux-data-science-vm?tab=Overview

ihavnoid commented 5 years ago

All these require running CUDA. The first thing we have to do is rewrite the OpenCL backend in CUDA, and then probably use the SGEMM routines from cuTLASS and others. Quite a lot of work but would be interesting if somebody can pull it off.

Problem is that it will benefit all the RTX 2xxx GPUs and probably Titan V / Tesla V100 GPUs but won't benefit anything on anything older.

Ttl commented 5 years ago

I have actually been working a bit on porting the current OpenCL backend to run on CUDA but haven't got really far yet. The benefit of using CUDA is that the general CLBlast sgemm can be replaced with the hand optimized sgemm from cublas which is a lot faster. The same OpenCL transform kernels can be compiled for cuda with the help of special include file that replaces OpenCL specific terminology with CUDA terms. It should be little bit faster on all nvidia GPUs and a lot faster on the new GPUs that support tensor operations.

wonderingabout commented 5 years ago

Problem is that it will benefit all the RTX 2xxx GPUs and probably Titan V / Tesla V100 GPUs but won't benefit anything on anything older.

Tesla V100 can be a significant number of contributors with cloud free trials of google (400 hours computing) and microsoft azure (250-300 hours computing)

d7urban commented 5 years ago

Wasn’t there something about the license that makes it bad for the project? On Sat, 10 Nov 2018 at 11:20, wonderingabout notifications@github.com wrote:

Problem is that it will benefit all the RTX 2xxx GPUs and probably Titan V / Tesla V100 GPUs but won't benefit anything on anything older.

Tesla V100 can be a significant number of contributors with cloud free trials of google (400 hours computing) and microsoft azure (250-300 hours computing)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gcp/leela-zero/issues/2007#issuecomment-437573531, or mute the thread https://github.com/notifications/unsubscribe-auth/AM9pLpOQPrznEP_dBFj-TyQEuK9n1hb9ks5utqhTgaJpZM4YXtmf .

ihavnoid commented 5 years ago

@Ttl - Great! I was trying to put some effort doing it but since you are working on it I will try an alternative approach - writing inline PTX assembly and still reuse the same OpenCL code. Gonna spend the weekend reading PTX assembly manual. :)

Ttl commented 5 years ago

Very messy code for CUDA branch that currently only works with single precision here: https://github.com/Ttl/leela-zero/tree/cuda

The performance turned out to be slightly disappointing. It's not even faster than OpenCL. At batch size 1 OpenCL gets 73 n/s and CUDA gets only 56 n/s. Unlike cudnn it does scale little bit better at small batch sizes, but it can't exceed OpenCL performance at any batch size. At batch size of 5 OpenCL gets 107 n/s and CUDA gets 105 n/s. No idea why it's so slow, I would have expected cublas sgemm to be much faster.

The good thing about this work is that now LZ can be used with nVidias CUDA profiling tools. At higher batch sizes out_in transformation takes 42% of runtime. Optimizing it for higher batch sizes should get the OpenCL performance very close to cudnn. I already made some slight optimizations for it that increases nps by about 4% at batch size of 5.

EDIT: Apparently cublas uses 128x64 tile sgemm, when the second dimension is only 25 wasting over half of the work. OpenCL can be tuned for smaller tile size of 32 wasting much less work. That's why OpenCL is much faster at batch size of 1.

gcp commented 5 years ago

Wasn’t there something about the license that makes it bad for the project?

If you build a Leela Zero binary with CUDA support it becomes illegal to redistribute. You can use it on your own machine, but we can't make releases with it. Unfortunately cuTLASS still relies on the CUDA SDK so it doesn't avoid this problem. It does avoid the need to download cuDNN separately.

From a project perspective requiring people to make their own build doesn't matter so much (many are already doing it) but it will surely lead to frustration for less knowledged users and (this I fear) people making illegal binaries and posting them everywhere.

The situation with the RTX cards has made this a real issue though. It seems that their fp16 capability isn't exposed to OpenCL. So that means they lose half their performance. They have dedicated neural network hardware. That isn't exposed either. (Maybe surprisingly the latter isn't actually quite as bad as the former from a performance perspective, due to obscure interactions of mathematical optimizations and limited hardware precision.)

As much as I'd like to say "fuck NVIDIA, let them make decent OpenCL drivers" the current GPU market situation isn't going to let that work out for the foreseeable future. And we'll run at half the expected speed for the most popular cards. That's not good.

For making redistributable binaries, everyone who contributed code would need to agree to a license exemption for CUDA. I'm not sure that's going to be feasible. lc0 managed this, but they did it earlier before they had much contributors. I did a quick check and it's probably 20-ish people that all would need to agree for LZ.

I was looking at ffmpeg which had the same problem, and there someone basically clean-room implemented a dynamic loader for the CUDA DLLs and wrote the corresponding headers for it: https://github.com/FFmpeg/nv-codec-headers/tree/master/include/ffnvcodec Unfortunately it's probably not complete enough for us.

gcp commented 5 years ago

writing inline PTX assembly and still reuse the same OpenCL code

Do you mean shoving the compiled program into the GPU over the OpenCL interface? Interesting approach if it works, and yes, would avoid licensing issues.

alreadydone commented 5 years ago

FYI Lc0 actually managed to get permission from 27 contributors in 10 days (Jul 16-25): https://github.com/LeelaChessZero/lc0/issues/184#issuecomment-405394364.

A friend mentioned HIP to me which aims to port CUDA apps to ROCm/MIOpen to run on AMD cards, but I have no idea about the current state of development.

gcp commented 5 years ago

FYI Lc0 actually managed to get permission from 27 contributors in 10 days

Okay, maybe we should try as well.

gcp commented 5 years ago

A friend mentioned HIP to me which aims to port CUDA apps to ROCm/MIOpen to run on AMD cards, but I have no idea about the current state of development.

We don't have a need to run CUDA apps on AMD cards, the OpenCL backend is fine - see Ttl's benchmarks. The problems is that NVIDIA's OpenCL support is far behind their CUDA one at this point, and the lack of fp16 support hurts us greatly.

leela-zero / leela-zero

NVIDIA new deep learning software releases #2007

New Deep Learning Software Releases