cudnnFindConvolutionForwardAlgorithmEx vs cudnnGetConvolutionForwardAlgorithm

cancan101 commented 7 years ago

Following up on https://github.com/tensorflow/tensorflow/issues/7187#issuecomment-290284053, why does Tensorflow use cudnnGetConvolutionForwardAlgorithm rather than cudnnFindConvolutionForwardAlgorithmEx? It looks like Tensorflow tries to do the more complete profiling itself.

For reference, cudnnGetConvolutionForwardAlgorithm serves as a heuristic for obtaining the best suited algorithm for cudnnConvolutionForward for the given layer specifications. Based on the input preference, this function will either return the fastest algorithm or the fastest algorithm within a given memory limit. For an exhaustive search for the fastest algorithm, please use cudnnFindConvolutionForwardAlgorithm.

Whereas: cudnnFindConvolutionForwardAlgorithmEx function attempts all available cuDNN algorithms for cudnnConvolutionForward, using user-allocated GPU memory, and outputs performance metrics to a user-allocated array of cudnnConvolutionFwdAlgoPerf_t. These metrics are written in sorted fashion where the first element has the lowest compute time.

Looking at a number of other DNN, they seem to use cudnnFindConvolutionForwardAlgorithmEx / cudnnFindConvolutionForwardAlgorithm:

pytorch (when benchmark is on):
Theano (if time_once or time_on_shape_change)
cntk (non-static finder)

/CC @Yangqing @zheng-xq

asimshankar commented 7 years ago

@zheng-xq @vrv : Might one of you have some historical background on this choice, or general comments?

zheng-xq commented 7 years ago

cudnnGetConvolutionForwardAlgorithm is the fallback path. By default, TensorFlow does the autotuning by itself before cudnnFindConvolutionForwardAlgorithmEx is available. Also the custom implementation enables us to filter out the noise through multiple run steps. At this point, cudnnFindConvolutionForwardAlgorithmEx doesn't seem to offer more functionalities to justify a change.

In the future, the plan is to autotune both Cudnn algorithms and other custom kernels together, so we can also pick the fastest among both worlds.

asimshankar commented 7 years ago

@cancan101 : Does that answer your question? (Will wait a while before closing this out as intended behavior)

cancan101 commented 7 years ago

Yea, it does make sense. As an aside, might be nice to logout the results of the profiling runs. I think pytorch / torch7 has an option to do this.

asimshankar commented 7 years ago

Thanks. Closing this out.

It might make sense for the selected algorithm to be logged either to the logging system or maybe in the RunMetadata protocol buffer. If you'd like to make a contribution towards that, we'll be glad to take a look!

tensorflow / tensorflow

cudnnFindConvolutionForwardAlgorithmEx vs cudnnGetConvolutionForwardAlgorithm #8928