Closed el1995 closed 5 years ago
Please provide details about what platform you are using (operating system, architecture). Also include your TensorFlow version and if possible provide us the code snippet to reproduce the issue at hand. If you are unclear what to include see the issue template displayed in the Github new issue template. Thanks!
Platform & OS: Ubuntu 16.04 LTS Tensorflow version: no local installation, only using C API version 1.13.1 (https://www.tensorflow.org/install/lang_c) - CPU only version
Code, neural net and executable: inference.zip
This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there and can provide help. Thanks!
I'm currently working on a project that requires deep learning inference with Tensorflow's C API. I have a trained neural net (format: frozen graph) to do this. We use the inference for Computational Fluid Dynamics, which makes performance a key aspect for me. For example, one single simulation includes thousands of timesteps. In each timestep, the inference must be carried out for thousands of sets of input data. In my current case, I have a computational domain including 33400 cells and 880 boundary patches. That means, for each single of these thousands of timesteps I have to do the inference 34280 times. We use 3 input and 15 output values.
The whole inference process (from providing the input values to receiving the output values) requires a total of 91 milliseconds on my GPU. The actual inference step: TF_SessionRun(...) makes up for 98% of the computation time.
TF_CAPI_EXPORT extern void TF_SessionRun(TF_Session* session, const TF_Buffer* run_options, const TF_Output* inputs, TF_Tensor* const* input_values, int ninputs, const TF_Output* outputs, TF_Tensor** output_values, int noutputs, const TF_Operation* const* target_opers, int ntargets, TF_Buffer* run_metadata, TF_Status*);
The problem now is that I need to do the inference 34280 times in every timestep, which then takes approximately 52 minutes. That means for thousands of timesteps, the computation time is extensive.
Surprisingly, if I convert the frozen graph to a uff-model and do the inference using TensorRT, it only takes me 90 milliseconds for all 34280 input sets. That means the speed-up of TensorRT vs. the C API would be about 35000. As we want to do the inference on a CPU-only architecture, later on, TensorRT is no option for me.
My question: do you know a way to use Tensorflow's C API in a way, that drastically reduces the computation time for multiple inferences? The bottleneck definitely is the TF_SessionRun(...) command, but I can not see a way to run 34280 inferences by only calling the command once. Moreover, the command provides several options (run options, run metadata, target operations, number of targets - see code above) that aren't used in a single example of those I found on the internet. Maybe these can be used to improve the performance?