Tesla V100 and CNTK poor performance

MMRohe commented 6 years ago

Hello,

I have tried to run a deep learning model with CNTK and an azure virtual machine that is equipped with tesla V100 GPU.

This GPU is supposed to be more powerful that the one I was training the model with before. Nevertheless I saw a decrease in the performance when training with CNTK:

Is it expected ?
Is it a problem of the Tesla architecture not yet being fully taken into account with CNTK ?
Is there a way to fix the problem ?

Thanks for your answer !

jaliyae commented 6 years ago

Assume you verified that the training was performed on the GPU in the VM instead of CPU. There could be many reasons, is the data coming from the local disk in the VM?, what type of GPUs are we comparing exactly? Are you running a benchmark or a custom test? Please let us know.

MMRohe commented 6 years ago

Hey jaliyae, thanks a lot for your answer. To answer your question:

I only monitor the time spent during training (ie. during the step trainer.train_minibatch(batch_inputs)) so any loading time of data which could influence the performance is not taken into account
I compared 4 things on the same code: the local CPU I have on my computer, the local GPU (GTX GeForce TI 1080), the CPU on the azure virtual machine and GPU on the azure virtual machine (Tesla V100)
I am running a custom test which just compute the time spent on the training per iteration on the exact same code. If you have another benchmark to propose that I could test I would be happy to do it.

Results are here, well performance is not catastrophic on Tesla V100 but I would have expected performance to be at least on par with 1080Ti if not better:

Local CPU (Intel Xeon 2145 3.70GHz) : 3.2 seconds per iteration
Local GPU (GeForce GTX 1080TI) : 0.036 seconds per iteration
Azure VM CPU (Intel Xeon E5-2690 2.60GHz) : 6.1 seconds per iteration
Azure VM GPU(Tesla V100): 0.043 seconds per iteration

FDecaYed commented 6 years ago

Which cntk binary are you using? what is the cuda/cudnn version? Are you doing (small) gemms in your benckmark? Math libraries could get slower when switching to new generation of card. It is because the new card does not exist yet when those libraries got release. So some optimization/heuristic may not be optimal compare to running on older card. There are optimizations getting into cublas with cuda releases and you could see better performance on volta when cntk move to higher version.

DHOFM commented 6 years ago

@MMRohe How did you get the V 100 in Azure - through the preview program ? I only see the P 100 and K 80 ( and the M Series). Do you have access of a "real" card ? In the K 80 setups in Azure the cheapest instance Nc6 is only a "half" card because of the dual GPU design. You can check through Nvidia-smi what you got... You will lose some time because of transferring your data to the card memory, because they are network attached. What makes the V 100 so "fast" for DL is the Tensor Core FLOPS - but I don´t know if CNTK supports this already... How much per Minute does MS charge for V 100 ? There is a big gap between K 80 and P 100 so it could be cheaper to let the old cards calculate ( even If it will take more time )...

MMRohe commented 6 years ago

@DHOFM yes I got the V100 through Azure preview program. I do not have access to a "real card" if by that you mean outside of a virtual machine. I think it charges something around 1.5$/hour for the V100.

After investigating a little bit the issue, it seems that it is indeed the tensor core flops that makes the V100 faster than another card. But this is only supported in cuda 9.0 which is not yet implemented in CNTK (prob in next CNTK release if I understood correctly). I have tried to compile it with the source code of github to try before the release but did not manage to get everything working yet. Will update the performance result when I have something running with a version compatible with 9.0

MMRohe commented 6 years ago

@FDecaYed Follow-up on the issue, I managed to get the last binaries of CNTK available on the github (well actually the version from 1 week ago which already supported float16/cuda9) working with Cuda 9.0. I ran the Resnet example of the image classification folder of CNTK. I get the following results:

GPU 1080Ti : ResNet training 8000 images/s. Tesla V100 and float32: ResNet training 7700 images/s. Tesla V100 and float16: ResNet training 7000 images/s.

So I think current CNTK implementation and Cuda 9.0 does not really support fully the new volta architecture. Is the last version of CNTK supposed to make use of the tensor cores already ?

Also float16 training is slower than float32 training which is quite unexpected

FDecaYed commented 6 years ago

@Rov67777 The latest should support volta tensor core already. I saw you have fairly fast imgs/sec, which benchmark are you running? For small networks, float 16 may not help a lot(the fact that v100 slower than 1080ti suggest you don't have enough work for GPU) and you may run into cpu/disk bottleneck already(in which case gpu performance doesn't matter at all). I would suggest you try Resnet 50 and above.

MMRohe commented 6 years ago

@FDecaYed thanks for your answer. Following your comment I tested the GPU on the resnet 110 traning example that can be found under: Examples\Image\Classification\ResNet\Python>python TrainResNet_CIFAR10.py -n resnet110

I use the very last version of CNTK with Cuda 9.0 and Tesla V100 and got the following performance:

float 32 (standard) training : 1600 images per second
float 16 training : 1500 images per second

So I still see decrease of performance when training with float 16. I cannot test this example and compare with GTX 1080 Ti as of now but I will probably try next week.

Is there any benchmark performance on some code that was run with CNTK ? So that I have some ground truth to compare.

FDecaYed commented 6 years ago

I see the problem, it is not the network that is too small. it is the cifar input, thus the actual convolution that are small. Cifar is a good introduction example to neural networks, but as a benchmark, the size(32x32) is not a good representation of works people do today. If you want to confirm fp16 works and see speed up, you can try larger images(like 224x224). If you do have workload that dealing with small images and networks, fp16 may not helps you a lot until it is further tuned to allow coarser grained operation and remove all the overhead as it is in preview stage now.

MMRohe commented 6 years ago

@FDecaYed well I have tried with one of my own network which performs convolution on large images and I get the same conclusion : no improvement of performance using float16 with respect to float32. I would be interested if somebody else manage to do some test and get similar results.

ke1337 commented 6 years ago

My test of ResNet50 runs at 760 images/s on V100 using fp16, while it's 220 images/s on P100 with fp32. Both tested using latest 2.4 release with CUDA 9 on single GPU.

FDecaYed commented 6 years ago

@Rov67777 The next step is to confirm whether algorithms using tensor cores is triggered for your convolution. Currently the easiest way to do that is nvprof. It would be helpful if you can run your test with nvprof and share the list of kernel called. If it is not triggered, please refer to cudnn doc to see the requirement. The most common reason is input/output feature map are required to be multiple of 8. In that case, you can just improve the feature map count a little to be multiple of 8.

microsoft / CNTK

Tesla V100 and CNTK poor performance #2848