Closed gawain102000 closed 5 years ago
Hope you help to have a look! And the testing log was attached too
Bo
Thanks for the bug report!
These are quite probably bugs in cudnn. I don't believe we have tested with cudnn 7.4.2 yet ourselves.
@timshen91 wdyt?
I'll take a look. If it's indeed a cuDNN 7.4.2 regression, I'll let the Nvidia folks know.
Thanks for your quick response! Please try if you can reproduce this issue and if any help on more info, please let me know here
Thanks
Hi everyone, Intermittently failed on cuDNN 7.4.1
Thanks
@gawain102000, I'm unable to reproduce it. See the full log below.
Which host compiler did you use?
~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
INFO: Invocation ID: 8da274cc-63d5-49f3-b9c3-2a9718323c70
INFO: Analysed target //:cudnn_test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //:cudnn_test up-to-date:
bazel-bin/cudnn_test
INFO: Elapsed time: 0.202s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_INFO: Build completed successfully, 1 total action
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //:cudnn_test
-----------------------------------------------------------------------------
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1212 23:10:47.320444 37724 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on TITAN V
Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Conv3d/ConvolutionTest
[ RUN ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[ OK ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME (20 ms)
[----------] 1 test from Conv3d/ConvolutionTest (20 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (20 ms total)
[ PASSED ] 1 test.
Ok, I did reproduced it. It's non-deterministic.
~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" --gtest_repeat=100 2>&1 | grep FAILED
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
[ FAILED ] 1 test, listed below:
[ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() =
1 FAILED TEST
Hi @timshen91
Cool! Thanks for your reproducing this! I worked on Ubuntu16.04 with default gcc
" gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609 "
Thanks Bo
Hi @gawain102000 ,
I noticed something. For this backward filter, for each convolution call, nvidia_libs_test by default fills the result buffer with NaNs.
The code is in TEST_P(ConvolutionTest, CompareResults)
, and the call is to FillWithNaNs()
. If I remove that line, all failures disappear.
I suspect that cuDNN is error-prone when the result buffer has garbage NaNs before the convolution happens.
Hi @timshen91
Thanks for your investigating on this! For the error log "[2788]: 0.22229 vs nan, error = nan", the correct value should be [2278] : 0.22229, and the error value is nan. I just think that several factors can make this non-deterministic issue
(1) Output at [2788] was never updated or (2) Output at [2788] was updated with correct value and then updated again with nan or (3) cuDNN gave nan at [2788], which is incorrect
BTW, both nvidia_libs_test and cudnn include some CUDA asynchronous calls and cannot be sure if they can work with expected behavior. Only is a guess .
Hope above info is helpful for you!
Thanks Bo
@gawain102000 can you try to reproduce it locally directly with cuDNN, with the filter buffer filled with NaNs before callong the conv?
Hi @timshen91 Yes, cuDNN can see what API was called by nvidia_libs_test and by only running testing on that API, currently, I still cannot reproduce this issue. And I will give a double check
Thanks
Wait, did you actually try to do that with the result buffer filled with NaNs?
There is an option to fill with NaN and I need to double check if it can really work
Thanks
BTW, I cannot be sure how device memory was requested or managed by nvidia_libs_test. There are two choices as following,
The first way is (1) Request the memory with the total size of input and output, let's use ptrT point to the beginning of it. And then (2) use ptrI(input point) point to ptrT and use ptrO(output point) point to ptrT + sizeof(input)
The second way is (1) Request the total memory for input and then use ptrI point to it (2) Request the total memory for output and then use ptrO point to it
Thanks Bo
Hi @timshen91
We use the following flow to run the test on cuDNN (1) When beta is zero, always fill hfilterOutput with NaN on host and then copy to device memset(hfilterOutput, 0xFF, filterOutputDesc.totalSize sizeof(half)); cudaMemcpy(dfilterOutput, hfilterOutput, filterOutputDesc.totalSize sizeof(half), cudaMemcpyHostToDevice)
(2) Use cudaMalloc on Input cudaMalloc((void *)&(devPtrI), InputTest.totalSize sizeof(half)));
(3) Use cudaMalloc on InputDiff cudaMalloc((void *)&(devPtrIdiff), InputDiffTest.totalSize sizeof(half)))
From above, for each of them, use cudaMalloc to create device space and then let a pointer point to the beginning which can be 32-, 64-, 128- or 512-byte segments of device memory that are aligned to their size
Currently, still no issue for me. And hope above is helpful for you!
Thanks Bo
All tensors are allocated individually in nvidia_libs_test (you should see this in an CUDA API trace). We rely on the default alignment being sufficient.
As Tim pointed out though, the failure isn't consistent (7 out of 100?), which would suggest this is a timing issue. Have you tried repeating your direct testing a number of times?
Hi @chsigg
Thanks for your response and your team's help on this issue! Yes, we are trying directly testing many times locally
Thanks
@nluehr is investigating this issue now.
Thanks for everybody's help! Looks like this is a cuDNN issue rather than framework. Since cuDNN engineer is internally investigating this issue, please permit me close here. And if you have any question, please let me know
Thanks
bazel run //:cudnn_test --action_env=CUDNN_PATH=cuda9.0_cudnn_v7.4.1/cuda --action_env=CUDA_PATH=cuda -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" DEBUG: /home/lab/.vulcan/install/cuda/_tests/google_cudnn_test/codes/nvidia_libs_test-master/cuda_configure.bzl:12:5: Using CUDA from /home/lab/.vulcan/install/cuda DEBUG: /home/lab/.vulcan/install/cuda/_tests/google_cudnn_test/codes/nvidia_libs_test-master/cuda_configure.bzl:13:5: Using cuDNN from /home/lab/bow/project/5_software/cuda9.0_cudnn_v7.4.1/cuda INFO: Analysed target //:cudnn_test (0 packages loaded). INFO: Found 1 target... Target //:cudnn_test up-to-date: bazel-bin/cudnn_test INFO: Elapsed time: 0.112s, Critical Path: 0.00s INFO: 0 processes. INFO: Build completed successfully, 1 total action INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=Conv3d/ConvolutionTest.CompareRINFO: Build completed successfully, 1 total action exec ${PAGER:-/usr/bin/less} "$0" || exit 1 Executing tests from //:cudnn_test WARNING: Logging before InitGoogleLogging() is written to STDERR I1212 02:57:55.774238 18306 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on Tesla V100-DGXS-16GB Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from Conv3d/ConvolutionTest [ RUN ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME cudnn_conv_test.cc:462: Failure Value of: IsOk(TensorDataEqual(ref_result_data, result_data, *result_desc, tolerance)) Actual: false (6 elements differ more than 10. Largest differences: [2788]: 0.22229 vs nan, error = nan [5904]: 0 vs nan, error = nan [1744]: 0 vs nan, error = nan [2784]: 0 vs nan, error = nan [1748]: 0.221191 vs nan, error = nan [5908]: 0.220581 vs nan, error = nan) Expected: true format: TENSOR_NCHW data_type: DATA_HALF compute_mode: DATA_HALF math_type: DEFAULT_MATH algo: CONVOLUTION_BWD_FILTER_ALGO_1 [ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = reference { input { dimension: 82 dimension: 4 dimension: 79 dimension: 9 dimension: 2 data_type: DATA_DOUBLE format: TENSOR_NCHW } filter { dimension: 12 dimension: 4 dimension: 2 dimension: 13 dimension: 5 data_type: DATA_DOUBLE format: TENSOR_NCHW } convolution { pad: 1 pad: 6 pad: 2 compute_mode: DATA_DOUBLE } one_minus_alpha: 0.99996569585949024 bwd_filter_algo: CONVOLUTION_BWD_FILTER_ALGO_0 label: "CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" } test { input { data_type: DATA_HALF format: TENSOR_NCHW } filter { data_type: DATA_HALF format: TENSOR_NCHW } convolution { compute_mode: DATA_HALF math_type: DEFAULT_MATH } all_algos: CONVOLUTION_BWD_FILTER } (21 ms) [----------] 1 test from Conv3d/ConvolutionTest (21 ms total)
[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (21 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = reference { input { dimension: 82 dimension: 4 dimension: 79 dimension: 9 dimension: 2 data_type: DATA_DOUBLE format: TENSOR_NCHW } filter { dimension: 12 dimension: 4 dimension: 2 dimension: 13 dimension: 5 data_type: DATA_DOUBLE format: TENSOR_NCHW } convolution { pad: 1 pad: 6 pad: 2 compute_mode: DATA_DOUBLE } one_minus_alpha: 0.99996569585949024 bwd_filter_algo: CONVOLUTION_BWD_FILTER_ALGO_0 label: "CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" } test { input { data_type: DATA_HALF format: TENSOR_NCHW } filter { data_type: DATA_HALF format: TENSOR_NCHW } convolution { compute_mode: DATA_HALF math_type: DEFAULT_MATH } all_algos: CONVOLUTION_BWD_FILTER }
1 FAILED TEST