Closed cancan101 closed 7 years ago
@zheng-xq @vrv : Might one of you have some historical background on this choice, or general comments?
cudnnGetConvolutionForwardAlgorithm is the fallback path. By default, TensorFlow does the autotuning by itself before cudnnFindConvolutionForwardAlgorithmEx is available. Also the custom implementation enables us to filter out the noise through multiple run steps. At this point, cudnnFindConvolutionForwardAlgorithmEx doesn't seem to offer more functionalities to justify a change.
In the future, the plan is to autotune both Cudnn algorithms and other custom kernels together, so we can also pick the fastest among both worlds.
@cancan101 : Does that answer your question? (Will wait a while before closing this out as intended behavior)
Yea, it does make sense. As an aside, might be nice to logout the results of the profiling runs. I think pytorch / torch7 has an option to do this.
Thanks. Closing this out.
It might make sense for the selected algorithm to be logged either to the logging system or maybe in the RunMetadata
protocol buffer. If you'd like to make a contribution towards that, we'll be glad to take a look!
Following up on https://github.com/tensorflow/tensorflow/issues/7187#issuecomment-290284053, why does Tensorflow use
cudnnGetConvolutionForwardAlgorithm
rather thancudnnFindConvolutionForwardAlgorithmEx
? It looks like Tensorflow tries to do the more complete profiling itself.For reference,
cudnnGetConvolutionForwardAlgorithm
serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionForward for the given layer specifications. Based on the input preference, this function will either return the fastest algorithm or the fastest algorithm within a given memory limit. For an exhaustive search for the fastest algorithm, please usecudnnFindConvolutionForwardAlgorithm
.Whereas:
cudnnFindConvolutionForwardAlgorithmEx
function attempts all available cuDNN algorithms for cudnnConvolutionForward, using user-allocated GPU memory, and outputs performance metrics to a user-allocated array of cudnnConvolutionFwdAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time.Looking at a number of other DNN, they seem to use
cudnnFindConvolutionForwardAlgorithmEx
/cudnnFindConvolutionForwardAlgorithm
:time_once
ortime_on_shape_change
)/CC @Yangqing @zheng-xq