TF1.6 failed at segment fault, the dump trace is provided

oscarriddle commented 5 years ago

Sys info: CentOS7, TF1.6, Python3.6, CUDA9.0, cuDNN7.0.5, gcc 4.9.3,

Description:

<1> I compiled ucudnn and run the test, it works seems well. <2> Then I tried to patch it to TF1.6 with specific commit version and successfully compiled the TF1.6. <3> I tried to run a python script that normally create a graph to run normal Inceptionv3, then segment fault after printing: ``` --- Device: TITAN Xp CUDA version: 9000 cuDNN version: 7005 u-cuDNN version: 1.1.0 (commit 0b4663f12f1205514933354bbc7a7571e1331247) --- Thread 165 "python" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffc86bfd700 (LWP 36259)] 0x00007fff6b1147bb in ?? () from /lib64/libcuda.so.1 ``` <4> Then I use gdb to debug it and here is the trace back: ``` (gdb) bt #0 0x00007fff6b1147bb in ?? () from /lib64/libcuda.so.1 #1 0x00007fff6b036dcd in ?? () from /lib64/libcuda.so.1 #2 0x00007fff6b1c253b in cuEventRecord () from /lib64/libcuda.so.1 #3 0x00007fff5a77c57b in ?? () from /home/web_server/xiaolun/cuda-9.0-cudnn-7.0.5/lib64/libcudnn.so.7 #4 0x00007fff5a7c302b in ?? () from /home/web_server/xiaolun/cuda-9.0-cudnn-7.0.5/lib64/libcudnn.so.7 #5 0x00007fff59b16bda in cudnnSetStream () from /home/web_server/xiaolun/cuda-9.0-cudnn-7.0.5/lib64/libcudnn.so.7 #6 0x00007fff759f005b in bool perftools::gputools::cuda::CudnnSupport::DoConvolveImpl(perftools::gputools::Stream*, int, perftools::gputools::dnn::BatchDescriptor const&, perftools::gputools::DeviceMemory const&, perftools::gputools::dnn::FilterDescriptor const&, perftools::gputools::DeviceMemory const&, perftools::gputools::dnn::ConvolutionDescriptor const&, perftools::gputools::dnn::BatchDescriptor const&, perftools::gputools::DeviceMemory*, perftools::gputools::ScratchAllocator*, perftools::gputools::dnn::AlgorithmConfig const&, perftools::gputools::dnn::ProfileResult*) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #7 0x00007fff759f110c in perftools::gputools::cuda::CudnnSupport::DoConvolve(perftools::gputools::Stream*, perftools::gputools::dnn::BatchDescriptor const&, perftools::gputools::DeviceMemory const&, perftools::gputools::dnn::FilterDescriptor const&, perftools::gputools::DeviceMemory const&, perftools::gputools::dnn::ConvolutionDescriptor const&, perftools::gputools::dnn::BatchDescriptor const&, perftools::gputools::DeviceMemory*, perftools::gputools::ScratchAllocator*, perftools::gputools::dnn::AlgorithmConfig const&, perftools::gputools::dnn::ProfileResult*) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #8 0x00007fff7588f7d6 in perftools::gputools::Stream::ThenConvolveWithAlgorithm(perftools::gputools::dnn::BatchDescriptor const&, perftools::gputools::DeviceMemory const&, perftools::gputools::dnn::FilterDescriptor const&, perftools::gputools::DeviceMemory const&, perftools::gputools::dnn::ConvolutionDescriptor const&, perftools::gputools::dnn::BatchDescriptor const&, perftools::gputools::DeviceMemory*, perftools::gputools::ScratchAllocator*, perftools::gputools::dnn::AlgorithmConfig const&, perftools::gputools::dnn::ProfileResult*) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #9 0x00007fff7a215501 in tensorflow::LaunchConv2DOp::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, tensorflow::Tensor*, tensorflow::TensorFormat) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #10 0x00007fff7a219a08 in tensorflow::Conv2DOp::Compute(tensorflow::OpKernelContext*) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #11 0x00007fff75910b49 in tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #12 0x00007fff75910fc0 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #13 0x00007fff75945a06 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #14 0x00007fff75945bdf in std::_Function_handler const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #15 0x00007fff7559f358 in Eigen::NonBlockingThreadPoolTempl::WorkerLoop(int) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #16 0x00007fff7559e092 in std::_Function_handler)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/web_server/dlpy72/py3.6/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so #17 0x00007fff6c387760 in std::(anonymous namespace)::execute_native_thread_routine (__p=) at ../../../../../src_tree/gcc/libstdc++-v3/src/c++11/thread.cc:84 #18 0x00007ffff768ae25 in start_thread () from /lib64/libpthread.so.0 #19 0x00007ffff6caf34d in clone () from /lib64/libc.so.6 ``` Problem: The bt implies the error occurred from libcuda when calling Conv2D. Would you give some hints to address the problem? 私のテストモデルは通常のInceptonV３です。 TF１.6がConv２Dを実行するときsegment faultが発生しました。 libcudaからのデバッグ情報はないため、具体的な原因はこっちからわからないです。よろしくお願いいたします

oscarriddle commented 5 years ago

---
Device: TITAN Xp
CUDA version: 9000
cuDNN version: 7005
u-cuDNN version: 1.1.0 (commit 0b4663f12f1205514933354bbc7a7571e1331247)
---

2019-09-06 17:04:48.845554: I tensorflow/stream_executor/stream.cc:697] Called Stream::ThenConvolveWithAlgorithm(input_descriptor=b16d3s299 299 , input_data=0x7f6659022000, filter_descriptor=od32id3s3 3 , filter_data=0x7f665a080b00, convolution_descriptor=p0:0_p1:0_s0:2_s1:2_d0:1_d1:1, output_descriptor=b16d32s149 149 , output=0x7f665a081900, algorithm_config=0, -1) stream=0xa9d79e0
2019-09-06 17:04:48.845585: I tensorflow/stream_executor/stream.cc:698] A Get in strema ThenConvolveWithAlgorithm()
2019-09-06 17:04:48.845630: I tensorflow/stream_executor/cuda/cuda_dnn.cc:2288] A Get in cuda_dnn ThenConvolveWithAlgorithm()
Environment variable UCUDNN_BATCH_SIZE_POLICY is not set. Using "powerOfTwo" instead.
Environment variable UCUDNN_BENCHMARK_DEVICES is not set. Using "" instead.
Environment variable UCUDNN_TOTAL_WORKSPACE_SIZE is not set. Using "0" instead.
Environment variable UCUDNN_DATABASE is not set. Using "" instead.
2019-09-06 17:04:48.845766: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2294] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR
Aborted (core dumped)

Error came from:

  auto status = wrap::cudnnSetStream(parent_, ToHandle(dnn_handle_),
                                     AsCUDAStreamValue(stream));

According to cuDNN developer guide:

4.206. cudnnSetStream
cudnnStatus_t cudnnSetStream(
    cudnnHandle_t   handle,
    cudaStream_t    streamId)
This function sets the user's CUDA stream in the cuDNN handle. The new stream will be used to launch cuDNN GPU kernels or to synchronize to this stream when cuDNN kernels are launched in the internal streams. If the cuDNN library stream is not set, all kernels use the default (NULL) stream. Setting the user stream in the cuDNN handle guarantees the issue-order execution of cuDNN calls and other GPU kernels launched in the same stream.

Parameters

handle
Input. Pointer to the cuDNN handle.

streamID
Input. New CUDA stream to be written to the cuDNN handle.

Returns

CUDNN_STATUS_BAD_PARAM
Invalid (NULL) handle.

CUDNN_STATUS_MAPPING_ERROR
Mismatch between the user stream and the cuDNN handle context.

CUDNN_STATUS_SUCCESS
The new stream was set successfully.

Maybe a wrapped API for cudnnSetStream is needed, or whether this setting can be bypassed.

oscarriddle commented 5 years ago

I tried to bypass the cudnnSetStream by just commented out below codes:

  //auto status = wrap::cudnnSetStream(parent_, ToHandle(dnn_handle_), AsCUDAStreamValue(stream));//.getCudnnHandle(),                                                                                                    
  //if (status != CUDNN_STATUS_SUCCESS) {                                                                                                    
  //  LOG(FATAL) << "failed to set stream for cudnn handle: " << ToString(status);                                                           
  //}

So actually cudnn handle will be set at stream 0 by default. The cuda_dnn.cc proceeds to status = wrap::cudnnGetConvolutionForwardWorkspaceSize() around line 2400.

This function actually called getWorkspaceSize in ucudnnHandle.cpp, then called getAlgorithm in ucudnnHandle.cpp, then called optimizer, finally called cudnnFindConvolutionGenericAlgorithm() in utils.cpp, then error out with cudnnStatus_t 3, which is CUDNN_STATUS_BAD_PARAM.

In API cudnnFindConvolutionGenericAlgorithm(), CUDNN_STATUS_BAD_PARAM indicates:

CUDNN_STATUS_BAD_PARAM
At least one of the following conditions are met:

handle is not allocated properly.
xDesc, wDesc or yDesc is not allocated properly.
xDesc, wDesc or yDesc has fewer than 1 dimension.
Either returnedCount or perfResults is nil.
requestedCount is less than 1.

Here I found either failed at cudnnSetStream, or at getalgorithm, so there must exists some different between TF and normal way of implementation of cudnn.

Would you help address this problem, or give some advice?

oscarriddle commented 5 years ago

Solved by changing the way of Tensorflow to deal with cudnn handle. Now ucudnn API can be successfully called by Tensorflow.

tbennun commented 5 years ago

This is great. Would you like to upload a pull request?

oscarriddle commented 5 years ago

@tbennun please refer to #6

spcl / ucudnn

TF1.6 failed at segment fault, the dump trace is provided #3