The inference speed of the torch compiled manually is slower than the torch build from official binaries?

PoonKinWang commented 4 years ago

I compile the torch(v1.4.0 cudatoolkit=10.2 cudnn=7.6.5) from source with the command "python setup install". And then load the model of shufflenet v2 0.5 with the compiled library(batchsize=8), the speed is 0.0083. But when I load the same model with the torch(v1.3.1 cudatoolkit=10.0 cudnn=7.6.4) build by official binaries through conda, the speed is 0.0075. Why is this fast? My GPU is 2080TI. Relevant config:

No OpenMP library needs to be linked against
-- Found CUDA: /usr/local/cuda-10.2 (found version "10.2") 
-- Caffe2: CUDA detected: 10.2
-- Caffe2: CUDA nvcc is: /usr/local/cuda-10.2/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda-10.2
-- Caffe2: Header version is: 10.2
-- Found cuDNN: v7.6.5  (include: /usr/local/cuda-10.2/include, library: /usr/local/cuda-10.2/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.5 7.5 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Autodetected CUDA architecture(s):  7.5 7.5 7.5
-- Could NOT find CUB (missing: CUB_INCLUDE_DIR) 
-- Found CUDA: /usr/local/cuda-10.2 (found suitable version "10.2", minimum required is "7.0") 
-- CUDA detected: 10.2
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARY)

cmakelog.log

cc @ezyang @VitalyFedyunin @ngimel @mruberry

zhangguanheng66 commented 4 years ago

Have you tried to run a lot of cases and generate an average value for benchmark purpose? BTW, it would be great if you could provide more code and details for us to reproducing the results.

PoonKinWang commented 4 years ago

Have you tried to run a lot of cases and generate an average value for benchmark purpose? BTW, it would be great if you could provide more code and details for us to reproducing the results.

Thanks for reply. The test demo is following:

demo.zip

ngimel commented 4 years ago

There's too many moving parts here. To begin, can you please compile 1.3.1 source tree from source with the same cudnn version as the pre-compiled binaries (7.6.4) and compare the result with the precompiled binaries? That would verify that your build environment is ok. Then we can see if there are some regressions introduced by the different cudnn version or pytorch 1.4

xsacha commented 4 years ago

Other than the obvious like your compiler and the actual code version being different, I'd say that the third-party libraries used may be different. One massive one is CUDA 10.2, which the prebuilts do not use.

pytorch / pytorch

The inference speed of the torch compiled manually is slower than the torch build from official binaries? #31004