gpgpu-sim / gpgpu-sim_distribution

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
Other
1.1k stars 505 forks source link

libcudnn seems not to call cudaLaunchKernel in GPGPU-Sim. #113

Open RedCarrottt opened 5 years ago

RedCarrottt commented 5 years ago

I’ve tried to run cudnn_samples_v7 with GPGPU-Sim, but its cuDNN kernels does not run on GPGPU-Sim. It makes following message when "g_debug_execution = 3".

GPGPU-Sim PTX: CUDA API function "cudaError_t cudaMemcpy(void*, const void*, size_t, cudaMemcpyKind)" has been called.
GPGPU-Sim PTX: cudaMemcpy(): devPtr = 0xc01a5300
GPGPU-Sim API: Stream Manager State
GPGPU-Sim API:    stream 0 has 1 operations
GPGPU-Sim API:       0 :  stream operation memcpy host-to-device
GPGPU-Sim: ** START simulation thread (detected work) **
GPGPU-Sim API: Stream Manager State
GPGPU-Sim API:    stream 0 has 1 operations
GPGPU-Sim API:       0 :  stream operation memcpy host-to-device
GPGPU-Sim API: stream 0 performing memcpy host-to-device
GPGPU-Sim PTX: copying 3136 bytes from CPU[0x7fffdc43aa00] to GPU[0xc01a5300] ...  done.
GPGPU-Sim: ** STOP simulation thread (no work) **
GPGPU-Sim: *** simulation thread starting and spinning waiting for work ***
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.020256 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.029696 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.037888 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.070240 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.072352 time requiring 2057744 memory

GPGPU-Sim PTX: CUDA API function "cudaError_t cudaMalloc(void**, size_t)" has been called.
GPGPU-Sim PTX: allocating 46080 bytes on GPU starting at address 0xc01a6000
GPGPU-Sim PTX: cudaMallocing 46080 bytes starting at 0xc01a6000..

cudaLaunchKernel function should be called after “Testing cudnnFindConvolutionForwardAlgorithm”, but it has never been called.

On the other hand, if I try to run CUDA sample(such as vectorAdd), it does work well.

I guess that my cuDNN library does not execute cudaLaunchKernel function in GPGPU-Sim ‘libcudart.so'. It seems to call cudaLaunchKernel in original 'libcudart.so'.

RedCarrottt commented 5 years ago

As @bigwater advised me, I used https://github.com/gpgpu-sim/gpgpu-sim_simulations. Before I build mnistCUDNN in the repository, I located original libcudart.so and libcudart_static.a in /usr/local/cuda/lib64. (without original libcudart_static.a, it fails to build.) Executing it, libcudnn successfully calls cudaLaunchKernel.

However, I've met deadlock and the simulator is terminated with following full log file.

mnistCUDNN.log

Even though mnistCUDNN successfully calls cudaLaunchKernel() function, PyTorch fails to call the function still.