tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.56k stars 74.33k forks source link

Running session using c++ api is significantly slower than using python #10669

Closed July-Morning closed 7 years ago

July-Morning commented 7 years ago

System information

No docker, no virtual environment. All tests were done using CPU only. All optimization flags are set, no warnings are shown. tcmalloc is also used. Batching also helps to increase the performance both for c++ and python, but the gap stays the same.

Describe the problem

I have written a simple benchmark based on official Deep MNIST example (https://www.tensorflow.org/get_started/mnist/pros). I create a simple convolutional net, train it on MNIST (number of training steps is small as we are interested in speed, not accuracy), freeze the graph and load it to c++. Then I do tests using python and using c++ and measure average time that it takes to run a tensorflow session. My tests show that running session in python takes ~2ms, while doing the same using c++ api is slower: ~3ms. The code of the benchmark can be found here: https://github.com/July-Morning/MNIST_convnet_tensorflow It should also be mentioned that the same tests were done for the multilayer perceptron. In that case c++ api was significantly (~7-10x times) faster than python just as was expected.

aselle commented 7 years ago

Are you sure you build them the same way? I.e. were you running the python from a whl created from the same compilation you did for the C++ example? If you didn't run from the same, you should make sure you had optimization enabled for your c++ build https://stackoverflow.com/questions/27086145/what-is-the-default-build-configuration-of-cmake It looks like not since you didn't do cmake -DCMAKE_BUILD_TYPE=Release (or whatever is the right thing for cmake). Let me know if this helps. Thanks.

July-Morning commented 7 years ago

Thank you for the quick response!

Well, I tried adding -DCMAKE_BUILD_TYPE=Release, but the results are the same. For c++ I am using tensorflow headers built by bazel with all optimizations.

aselle commented 7 years ago

@alextp and @skye, thought you'd be interested in this. Please redirect if others are better able handle this.

alextp commented 7 years ago

Is this related to using different clocks in the python and C++ code?

July-Morning commented 7 years ago

Any idea how to compare running times in a better way? Tried measuring time in Python using timeit.default_timer(), got the same results.

In fact, the reason I created this issue is as follows: after I trained my own net (more complicated that the one in benchmark) and started to test it I got real-time performance in Python but not in C++, and the bottleneck was running tensorflow session.

alextp commented 7 years ago

I am more worried about the C++ usage of clock, which I don't know how to interpret in multicore settings. Why not use std::chrono::system_clock::now() which should track real time reasonably well?

On Wed, Jun 14, 2017 at 1:26 AM, July-Morning notifications@github.com wrote:

Any idea how to compare running times in a better way? Tried measuring time in Python using timeit.default_timer(), got the same results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-308359389, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxTS2sfTHa567tMFK30WYl1lKd5aPks5sD5k4gaJpZM4N4KAA .

--

July-Morning commented 7 years ago

@alextp I did time measurements with std::chrono::system_clock::now() and got the same results.

July-Morning commented 7 years ago

@alextp Oh, no, sorry, my mistake. Thank you a lot. For the benchmark I posted this solved the issue, but my original net is still slower in C++. Further investigation needed. Will close the issue for now and reopen if necessary. Is that OK?

alextp commented 7 years ago

Sure, thanks!

On Wed, Jun 14, 2017 at 10:14 AM, July-Morning notifications@github.com wrote:

@alextp https://github.com/alextp Oh, no, sorry, my mistake. Thank you a lot. For the benchmark I posted this solved the issue, but my original net is still slower in C++. Further investigation needed. Will close the issue for now and reopen if necessary. Is that OK?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-308497350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxe_H1mwCPWhPQE6slx9odVCnWRV8ks5sEBThgaJpZM4N4KAA .

--

July-Morning commented 7 years ago

For quite a time I was busy with another project, but finally I got back to that issue again. I found out that the reason was not moving from Python to C++ but freezing the graph. Even when I load frozen graph into Python code I get my model running 2-3 times slower. Same problem was reported here. The question is now: is it possible to load tensorflow model to C++ without freezing the graph? Or is it possible to speed up running session on a frozen graph?

cPython prof files are attached in case they would be useful: prof.zip Only tensorflow session.run was profiled. demo_session.prof is for restoring from the checkpoint case, demo_graph.prof is for loading the frozen graph case.

alextp commented 7 years ago

Ah, I see. Are you using GPU variables? Maybe the graph freezing code isn't placing them correctly.

On Tue, Jul 25, 2017 at 8:30 AM, July-Morning notifications@github.com wrote:

For quite a time I was busy with another project, but finally I got back to that issue again. I found out that the reason was not moving from Python to C++ but freezing the graph. Even when I load frozen graph into Python code I get my model running 2-3 times slower. Same problem was reported here https://github.com/tensorflow/tensorflow/issues/3216. The question is now: is it possible to load tensorflow model to C++ without freezing the graph? Or is it possible to speed up running session on a frozen graph?

cPython prof files are attached in case they would be useful: prof.zip https://github.com/tensorflow/tensorflow/files/1173861/prof.zip Only tensorflow session.run was profiled. demo_session.prof is for restoring from the checkpoint case, demo_graph.prof is for loading the frozen graph case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-317774700, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxcDvCj1rOtqaUhjhHfYy4plA4E_uks5sRgongaJpZM4N4KAA .

--

July-Morning commented 7 years ago

I have no GPU on my computer (and tensorflow is built in cpu-only mode). And I tried to remove manually all the parts mentioning GPU from graph.pb - didn't help at all.

alextp commented 7 years ago

Is it feasible for you to not freeze your graph, then?

On Tue, Jul 25, 2017 at 8:51 AM, July-Morning notifications@github.com wrote:

I have no GPU on my computer (and tensorflow is built in cpu-only mode). And I tried to remove manually all the parts mentioning GPU from graph.pb

  • didn't help at all.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-317779295, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxfV8vNl3macwk8XPk-uPIcnlymofks5sRg8HgaJpZM4N4KAA .

--

alextp commented 7 years ago

Adding @petewarden as he might have a better understanding than I do about frozen graphs

July-Morning commented 7 years ago

I need to use my trained model in C++ project. I couldn't find a way to load it without freezing. If it is possible, could you please give me a clue how to do that?

yaroslavvb commented 7 years ago

Graph freezing turns all parameters into constants. Historically constants have been less performant than variables (ie, stored in the graph datastructure which has additional locking).

July-Morning commented 7 years ago

Got it, but is there any way to use model I have in C++ project without 2-3x loss in performance? Maybe it is possible to load model without freezing it? Any example or workaround would be really useful, because now I have real-time when I restore a model in Python and can't achieve anything close in C++.

yaroslavvb commented 7 years ago

How about SavedModel?

July-Morning commented 7 years ago

Thank you! I will try SavedModel and report the results here.

petewarden commented 7 years ago

To expand on Yaroslav's comments, there are some potential initialization overheads due to the way we copy GraphDefs, but we don't expect performance to be noticeably slower on real models (and can actually be faster if you use memmapping).

With that said, using MNIST as a benchmark probably isn't very useful, since the amount of computation involved is very small and so initialization and other overheads will predominate. Have you looked at https://www.tensorflow.org/performance/benchmarks for some more representative models?

July-Morning commented 7 years ago

I am not using MNIST benchmark anymore, it was only provided as an example here. What I am working with is a more complicated net (modified SqueezeNet on a 800x450 image), so I don't think initialization is a main issue here. And I am comparing only performance of session.run itself on a set of consequent runs in a loop. Seems that memmapping and SavedModel are the next steps for me.

Again, thank you all for incredibly helpful and quick responses!

July-Morning commented 7 years ago

More questions than answers for now:

1) I tried memmapping of the frozen graph with convert_graphdef_memmapped_format, but when trying to load mapped frozen graph I get parse/reading errors both in C++ (ReadBinaryProto) and Python (import_graph_def). Should I do memmapping before or after freezing the graph? Haven't tried the second option cause I have initial graph.pb in text, not binary format.

2) Is there a working example of saving model into SavedModel? I found these, but it is not really clear what to do with FasterRCNN-like architecture (which is my case), when one has also bounding boxes as output and may have more that one input. Specifically I am confused with tf.saved_model.signature_constants.

alextp commented 7 years ago

Re (2) you can use a custom signature for your model instead of the standard classification / regression ones. You can make a signaturedef by filling out the proto in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/meta_graph.proto#L293 and pass it in the signature_def_map of your savedmodel as explained in https://tensorflow.github.io/serving/serving_basic.html

On Wed, Jul 26, 2017 at 8:13 AM, July-Morning notifications@github.com wrote:

More questions than answers for now:

1.

I tried memmapping of the frozen graph with convert_graphdef_memmapped_format, but when trying to load mapped frozen graph I get parse/reading errors both in C++ (ReadBinaryProto) and Python (import_graph_def). Should I do memmapping before or after freezing the graph? Haven't tried the second option cause I have initial graph.pb in text, not binary format. 2.

Is there a working example of saving model into SavedModel? I found these https://github.com/tensorflow/serving/tree/master/tensorflow_serving/example, but it is not really clear what to do with FasterRCNN-like architecture (which is my case), when one has also bounding boxes as output and may have more that one input. Specifically I am confused with tf.saved_model.signature_constants.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-318084041, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxdcxgjGyW5jg36VImVvLPqi86Zctks5sR1ehgaJpZM4N4KAA .

--

July-Morning commented 7 years ago

Great, I will try! Thank you!

July-Morning commented 7 years ago

Well, I managed to save my model in SavedModel format and load it back with Python, but running time is as slow as when using a frozen graph.

Here is how I save my model: ` export_path = './saved_model' + str(datetime.now()) + '/' builder = tf.saved_model.builder.SavedModelBuilder(export_path)

tensor_info_im = tf.saved_model.utils.build_tensor_info(model.image_input)
tensor_info_pr = tf.saved_model.utils.build_tensor_info(model.keep_prob)
tensor_info_bb = tf.saved_model.utils.build_tensor_info(model.det_boxes)
tensor_info_sc = tf.saved_model.utils.build_tensor_info(model.det_probs)

prediction_signature = (
    tf.saved_model.signature_def_utils.build_signature_def(
        inputs={'image_input:0': tensor_info_im, 'keep_prob:0':tensor_info_pr},
        outputs={'bbox/trimming/bbox:0': tensor_info_bb, 'probability/score:0': tensor_info_sc},
        method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
builder.add_meta_graph_and_variables(
    sess, [tf.saved_model.tag_constants.SERVING],
    signature_def_map={'predict': prediction_signature}, clear_devices = True)

builder.save()

`

And this is how I load it: tf.saved_model.loader.load(sess, ["serve"], export_dir)

Is something wrong here?

Something else that I noticed: when I am running a restored tf session in a loop, first two runs are approximately as slow as with frozen graph or SavedModel. However after that it gets much (2x-3x) faster.

alextp commented 7 years ago

In general the first run of a tensorflow session is slower as a lot of once-only computation gets performed (graph pruning, GPU set up, etc)

On Thu, Jul 27, 2017 at 7:31 AM, July-Morning notifications@github.com wrote:

Well, I managed to save my model in SavedModel format and load it back in Python, but running time is as slow as while using a frozen graph.

Here is how I save my model: ` export_path = './saved_model' + str(datetime.now()) + '/' builder = tf.saved_model.builder.SavedModelBuilder(export_path)

tensor_info_im = tf.saved_model.utils.build_tensor_info(model.image_input) tensor_info_pr = tf.saved_model.utils.build_tensor_info(model.keep_prob) tensor_info_bb = tf.saved_model.utils.build_tensor_info(model.det_boxes) tensor_info_sc = tf.saved_model.utils.build_tensor_info(model.det_probs)

prediction_signature = ( tf.saved_model.signature_def_utils.build_signature_def( inputs={'image_input:0': tensor_info_im, 'keep_prob:0':tensor_info_pr}, outputs={'bbox/trimming/bbox:0': tensor_info_bb, 'probability/score:0': tensor_info_sc}, method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)) builder.add_meta_graph_and_variables( sess, [tf.saved_model.tag_constants.SERVING], signature_def_map={'predict': prediction_signature}, clear_devices = True)

builder.save()

`

And this is how I load it: tf.saved_model.loader.load(sess, ["serve"], export_dir)

Is something wrong here?

Something else that I noticed: when I am running a restored tf session in a loop, first two runs are approximately as slow as with frozen graph or SavedModel. However after that it gets much (2x-3x) faster.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-318378089, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxWmtdA8CkqkSFd3zC_I4AEbbCEnOks5sSJ8qgaJpZM4N4KAA .

--

July-Morning commented 7 years ago

Yeah, I thought about it, too, but decided to mention just in case.

July-Morning commented 7 years ago

Any ideas? I have uploaded my scripts and model files to github: https://github.com/July-Morning/Tensorflow_model_utils I have restore_model.py running 2 times faster than load_frozen_graph.py or load_savedmodel.py for this specific model. May be that would help somehow...

July-Morning commented 7 years ago

I have also made a test on tensorflow built with GPU support (run all the scripts above on the computer with TitanX), of course everything is faster there, but the difference still exists (restoring model ~2x faster than loading frozen graph or saved model), so I have to say it does not seem to be an operation system or a specific build issue.

yaroslavvb commented 7 years ago

This issue is a bit confusing to follow, is the problem with C++ API, or is the issue with SavedModel?

July-Morning commented 7 years ago

Oh, I see. I will try to sum up shortly how it was and here we are now.

  1. I noticed that when I was loading frozen graph with C++ API tf session ran twice slower than in Python. I created a MNIST benchmark as an example, but @alextp showed that there was a stupid mistake in time measurement. So, for MNIST everything was fine (however, for my initial more complicated net it was not) and I closed the issue for a while.

From this moment on MNIST benchmark was not used, all experiments were done on another net.

  1. When I came back to this problem, I noticed that I have that slowdown even if I load frozen graph with standard Python API. Thus, the issue seemed to be not in C++ API.

From this moment on I was using Python only!

  1. @yaroslavvb advised me to try using SavedModel instead of freezing the graph, and I did (not sure if perfectly correct, but it worked) and got the same deceleration.

  2. I have run my scripts on another computer (with GPU) and got the same difference in time, so it does not seem to be an operation system or a specific build issue.

So now the problem is as follows: I have real-time performance when I am using restored model, but I get a 2-3x slowdown when I am trying to load either frozen graph or SavedModel. The question is: why (or what am I doing wrong) and how to get rid of this effect? Model files and python scripts can be found here: https://github.com/July-Morning/Tensorflow_model_utils

Please let me know if it would be better to open another issue after all this mess.

July-Morning commented 7 years ago

Anything else I can do to make things more clear? Any tests or profiling? I would be really grateful for any tiny chance to solve that issue.

yaroslavvb commented 7 years ago

I cloned your repo and tried to run it and get

  File "load_savedmodel.py", line 41
    im = np.random.rand(IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_DEPTH) 
                                                               ^
TabError: inconsistent use of tabs and spaces in indentation

The best bug reports are the ones that provide a line number of where the bug is :) If I were you, I would first try reducing to a simplest possible example (ie, single variable, single op), maybe combined with profiling the code (TF timeline/snakeviz) to see where the slowness is. If you find the source of the problem, you could use git blame to find who added this line to the code, and cc them on the bug

July-Morning commented 7 years ago

Thank you for the advice. I will do my best to follow it and will definitely fix code in the repo :)

By the way, I have attached some profiling results with snakeviz earlier. Will do it again here: prof.zip

July-Morning commented 7 years ago

I have updated my repo, hope now everything is alright with the indentations. https://github.com/July-Morning/Tensorflow_model_utils Profiling is also done here, it shows main difference in _pywrap_tensorflow_internal.TF_Run and session.py(_run)

July-Morning commented 7 years ago

Minor update here: the slowdown seems to happen only in case of cpu version of tensorflow and in case of gpu version of tensorflow running on several GPUs. In case of tensorflow-gpu running on one GPU the times for all three cases are nearly equal.

July-Morning commented 7 years ago

Update: if I do training and evaluation with the same batch size (equal to 1), the slowdown disappears.

July-Morning commented 7 years ago

Finally found the reason of the problem: it was variable batch size.

Thank you all, I'm closing the issue!

brezhnyev commented 6 years ago

Hi, guys One thing i cannot understand, am I missing the frozen_graph.pb to successfully start the program? Thanks a lot for answer.

ydp commented 6 years ago

@July-Morning Hi, I see you finally solved the problem, the reason is variable batch size. how did you solve this in your production environment? Do you mean the batch size of training and evaluation has to be the same? But in my situation, we often train in large batch size like 4096, but inference with small batch like 20, thus I would be definitely suffer from this problem, is this the situation here?

Thanks very much.

abhigoku10 commented 6 years ago

@July-Morning @yaroslavvb Hi guys i have a model which was trained on gpu . This model with python code takes approx 4 sec on linux cpu system , when i use the same model on the cpu i get the timing of 11-16 sec can you point out the problem y i get this much timing difference and how can i solve this

Thanks in advance

yaroslavvb commented 6 years ago

All fast programs are alike, every slow program is slow in its own way --Tolstoy

abhigoku10 commented 6 years ago

@yaroslavvb sorry i did not get you , are you quoting Tolstoy for me of giving me suggestion on it

lovychen commented 5 years ago

@July-Morning @yaroslavvb Hi guys i have a model which was trained on gpu . This model with python code takes approx 4 sec on linux cpu system , when i use the same model on the cpu i get the timing of 11-16 sec can you point out the problem y i get this much timing difference and how can i solve this

I have the same problem and have not solved;

sathyarr commented 5 years ago

Adding @petewarden as he might have a better understanding than I do about frozen graphs

@petewarden @yaroslavvb Which will have good performance or speed improvements for Inference(only) with C++?

  1. Loading a frozen model
  2. Loading a graph/Checkpoints

Or not C++ at all? will loading with Python be actually faster? or it depends on training parameters if any?

tangjie77wd commented 5 years ago

Update: if I do training and evaluation with the same batch size (equal to 1), the slowdown disappears.

Do you mean that the inference slowdown in tensorflow c++ disappears after you use the model which be trained and evaluated in python with the same batch size (equal to 1)? I have the same problem with you (Running session using c++ api is significantly slower than using python) and i have tried many solutions as the following shows: 1.compile tensorflow c++ shared library with opmizition flags:AVX/AVX2/SSE4.1/SSE4.2/FMA/XLA 2.MKL-DNN 3.replace single image inference in tensorflow c++ with batch inference https://stackoverflow.com/questions/57460782/batch-inference-is-as-slow-as-single-image-inference-in-tensorflow-c These above all are useless to me.