an intermittent error, failed on V100

gawain102000 commented 5 years ago

bazel run //:cudnn_test --action_env=CUDNN_PATH=cuda9.0_cudnn_v7.4.1/cuda --action_env=CUDA_PATH=cuda -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" DEBUG: /home/lab/.vulcan/install/cuda/_tests/google_cudnn_test/codes/nvidia_libs_test-master/cuda_configure.bzl:12:5: Using CUDA from /home/lab/.vulcan/install/cuda DEBUG: /home/lab/.vulcan/install/cuda/_tests/google_cudnn_test/codes/nvidia_libs_test-master/cuda_configure.bzl:13:5: Using cuDNN from /home/lab/bow/project/5_software/cuda9.0_cudnn_v7.4.1/cuda INFO: Analysed target //:cudnn_test (0 packages loaded). INFO: Found 1 target... Target //:cudnn_test up-to-date: bazel-bin/cudnn_test INFO: Elapsed time: 0.112s, Critical Path: 0.00s INFO: 0 processes. INFO: Build completed successfully, 1 total action INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=Conv3d/ConvolutionTest.CompareRINFO: Build completed successfully, 1 total action exec ${PAGER:-/usr/bin/less} "$0" || exit 1 Executing tests from //:cudnn_test WARNING: Logging before InitGoogleLogging() is written to STDERR I1212 02:57:55.774238 18306 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on Tesla V100-DGXS-16GB Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from Conv3d/ConvolutionTest [ RUN ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME cudnn_conv_test.cc:462: Failure Value of: IsOk(TensorDataEqual(ref_result_data, result_data, *result_desc, tolerance)) Actual: false (6 elements differ more than 10. Largest differences: [2788]: 0.22229 vs nan, error = nan [5904]: 0 vs nan, error = nan [1744]: 0 vs nan, error = nan [2784]: 0 vs nan, error = nan [1748]: 0.221191 vs nan, error = nan [5908]: 0.220581 vs nan, error = nan) Expected: true format: TENSOR_NCHW data_type: DATA_HALF compute_mode: DATA_HALF math_type: DEFAULT_MATH algo: CONVOLUTION_BWD_FILTER_ALGO_1 [ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = reference { input { dimension: 82 dimension: 4 dimension: 79 dimension: 9 dimension: 2 data_type: DATA_DOUBLE format: TENSOR_NCHW } filter { dimension: 12 dimension: 4 dimension: 2 dimension: 13 dimension: 5 data_type: DATA_DOUBLE format: TENSOR_NCHW } convolution { pad: 1 pad: 6 pad: 2 compute_mode: DATA_DOUBLE } one_minus_alpha: 0.99996569585949024 bwd_filter_algo: CONVOLUTION_BWD_FILTER_ALGO_0 label: "CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" } test { input { data_type: DATA_HALF format: TENSOR_NCHW } filter { data_type: DATA_HALF format: TENSOR_NCHW } convolution { compute_mode: DATA_HALF math_type: DEFAULT_MATH } all_algos: CONVOLUTION_BWD_FILTER } (21 ms) [----------] 1 test from Conv3d/ConvolutionTest (21 ms total)

[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (21 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = reference { input { dimension: 82 dimension: 4 dimension: 79 dimension: 9 dimension: 2 data_type: DATA_DOUBLE format: TENSOR_NCHW } filter { dimension: 12 dimension: 4 dimension: 2 dimension: 13 dimension: 5 data_type: DATA_DOUBLE format: TENSOR_NCHW } convolution { pad: 1 pad: 6 pad: 2 compute_mode: DATA_DOUBLE } one_minus_alpha: 0.99996569585949024 bwd_filter_algo: CONVOLUTION_BWD_FILTER_ALGO_0 label: "CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" } test { input { data_type: DATA_HALF format: TENSOR_NCHW } filter { data_type: DATA_HALF format: TENSOR_NCHW } convolution { compute_mode: DATA_HALF math_type: DEFAULT_MATH } all_algos: CONVOLUTION_BWD_FILTER }

1 FAILED TEST

gawain102000 commented 5 years ago

Hope you help to have a look! And the testing log was attached too

Bo

jlebar commented 5 years ago

Thanks for the bug report!

These are quite probably bugs in cudnn. I don't believe we have tested with cudnn 7.4.2 yet ourselves.

@timshen91 wdyt?

timshen91 commented 5 years ago

I'll take a look. If it's indeed a cuDNN 7.4.2 regression, I'll let the Nvidia folks know.

gawain102000 commented 5 years ago

Thanks for your quick response! Please try if you can reproduce this issue and if any help on more info, please let me know here

Thanks

gawain102000 commented 5 years ago

Hi everyone, Intermittently failed on cuDNN 7.4.1

Thanks

timshen91 commented 5 years ago

@gawain102000, I'm unable to reproduce it. See the full log below.

Which host compiler did you use?

~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME"
INFO: Invocation ID: 8da274cc-63d5-49f3-b9c3-2a9718323c70
INFO: Analysed target //:cudnn_test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //:cudnn_test up-to-date:
  bazel-bin/cudnn_test
INFO: Elapsed time: 0.202s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: external/bazel_tools/tools/test/test-setup.sh ./cudnn_test '--gtest_filter=Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_INFO: Build completed successfully, 1 total action
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //:cudnn_test
-----------------------------------------------------------------------------
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1212 23:10:47.320444 37724 cudnn_util.cc:68] Running cuDNN v7.4.1 for CUDA 9.0.0 on TITAN V
Note: Google Test filter = Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Conv3d/ConvolutionTest
[ RUN      ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME
[       OK ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME (20 ms)
[----------] 1 test from Conv3d/ConvolutionTest (20 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (20 ms total)
[  PASSED  ] 1 test.

timshen91 commented 5 years ago

Ok, I did reproduced it. It's non-deterministic.

~/src/nvidia_libs_test % bazel run --define libunwind=true --action_env=CC=/usr/bin/gcc-6 --action_env=CUDA_PATH="$HOME/sandbox/cuda" :cudnn_test -- --gtest_filter="Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME" --gtest_repeat=100 2>&1 | grep FAILED
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Conv3d/ConvolutionTest.CompareResults/CONVOLUTION_BWD_FILTER_NCHW_TRUE_HALF_82x4x79x9x2_12x4x2x13x5_SAME, where GetParam() = 
 1 FAILED TEST

gawain102000 commented 5 years ago

Hi @timshen91
Cool! Thanks for your reproducing this! I worked on Ubuntu16.04 with default gcc

" gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609 "

Thanks Bo

timshen91 commented 5 years ago

Hi @gawain102000 ,

I noticed something. For this backward filter, for each convolution call, nvidia_libs_test by default fills the result buffer with NaNs.

The code is in TEST_P(ConvolutionTest, CompareResults), and the call is to FillWithNaNs(). If I remove that line, all failures disappear.

I suspect that cuDNN is error-prone when the result buffer has garbage NaNs before the convolution happens.

gawain102000 commented 5 years ago

Hi @timshen91

Thanks for your investigating on this! For the error log "[2788]: 0.22229 vs nan, error = nan", the correct value should be [2278] : 0.22229, and the error value is nan. I just think that several factors can make this non-deterministic issue

(1) Output at [2788] was never updated or (2) Output at [2788] was updated with correct value and then updated again with nan or (3) cuDNN gave nan at [2788], which is incorrect

BTW, both nvidia_libs_test and cudnn include some CUDA asynchronous calls and cannot be sure if they can work with expected behavior. Only is a guess .

Hope above info is helpful for you!

Thanks Bo

timshen91 commented 5 years ago

@gawain102000 can you try to reproduce it locally directly with cuDNN, with the filter buffer filled with NaNs before callong the conv?

gawain102000 commented 5 years ago

Hi @timshen91 Yes, cuDNN can see what API was called by nvidia_libs_test and by only running testing on that API, currently, I still cannot reproduce this issue. And I will give a double check

Thanks

timshen91 commented 5 years ago

Wait, did you actually try to do that with the result buffer filled with NaNs?

gawain102000 commented 5 years ago

There is an option to fill with NaN and I need to double check if it can really work

Thanks

gawain102000 commented 5 years ago

BTW, I cannot be sure how device memory was requested or managed by nvidia_libs_test. There are two choices as following,

The first way is (1) Request the memory with the total size of input and output, let's use ptrT point to the beginning of it. And then (2) use ptrI(input point) point to ptrT and use ptrO(output point) point to ptrT + sizeof(input)

The second way is (1) Request the total memory for input and then use ptrI point to it (2) Request the total memory for output and then use ptrO point to it

Thanks Bo

gawain102000 commented 5 years ago

Hi @timshen91

We use the following flow to run the test on cuDNN (1) When beta is zero, always fill hfilterOutput with NaN on host and then copy to device memset(hfilterOutput, 0xFF, filterOutputDesc.totalSize sizeof(half)); cudaMemcpy(dfilterOutput, hfilterOutput, filterOutputDesc.totalSize sizeof(half), cudaMemcpyHostToDevice)

(2) Use cudaMalloc on Input cudaMalloc((void *)&(devPtrI), InputTest.totalSize sizeof(half)));

(3) Use cudaMalloc on InputDiff cudaMalloc((void *)&(devPtrIdiff), InputDiffTest.totalSize sizeof(half)))

From above, for each of them, use cudaMalloc to create device space and then let a pointer point to the beginning which can be 32-, 64-, 128- or 512-byte segments of device memory that are aligned to their size

Currently, still no issue for me. And hope above is helpful for you!

Thanks Bo

chsigg commented 5 years ago

All tensors are allocated individually in nvidia_libs_test (you should see this in an CUDA API trace). We rely on the default alignment being sufficient.

As Tim pointed out though, the failure isn't consistent (7 out of 100?), which would suggest this is a timing issue. Have you tried repeating your direct testing a number of times?

gawain102000 commented 5 years ago

Hi @chsigg

Thanks for your response and your team's help on this issue! Yes, we are trying directly testing many times locally

Thanks

mruberry commented 5 years ago

@nluehr is investigating this issue now.

gawain102000 commented 5 years ago

Thanks for everybody's help! Looks like this is a cuDNN issue rather than framework. Since cuDNN engineer is internally investigating this issue, please permit me close here. And if you have any question, please let me know

Thanks

google / nvidia_libs_test

an intermittent error, failed on V100 #3