Closed July-Morning closed 7 years ago
Are you sure you build them the same way? I.e. were you running the python from a whl created from the same compilation you did for the C++ example? If you didn't run from the same, you should make sure you had optimization enabled for your c++ build https://stackoverflow.com/questions/27086145/what-is-the-default-build-configuration-of-cmake It looks like not since you didn't do cmake -DCMAKE_BUILD_TYPE=Release (or whatever is the right thing for cmake). Let me know if this helps. Thanks.
Thank you for the quick response!
Well, I tried adding -DCMAKE_BUILD_TYPE=Release, but the results are the same. For c++ I am using tensorflow headers built by bazel with all optimizations.
@alextp and @skye, thought you'd be interested in this. Please redirect if others are better able handle this.
Is this related to using different clocks in the python and C++ code?
Any idea how to compare running times in a better way? Tried measuring time in Python using timeit.default_timer(), got the same results.
In fact, the reason I created this issue is as follows: after I trained my own net (more complicated that the one in benchmark) and started to test it I got real-time performance in Python but not in C++, and the bottleneck was running tensorflow session.
I am more worried about the C++ usage of clock, which I don't know how to interpret in multicore settings. Why not use std::chrono::system_clock::now() which should track real time reasonably well?
On Wed, Jun 14, 2017 at 1:26 AM, July-Morning notifications@github.com wrote:
Any idea how to compare running times in a better way? Tried measuring time in Python using timeit.default_timer(), got the same results.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-308359389, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxTS2sfTHa567tMFK30WYl1lKd5aPks5sD5k4gaJpZM4N4KAA .
--
@alextp I did time measurements with std::chrono::system_clock::now() and got the same results.
@alextp Oh, no, sorry, my mistake. Thank you a lot. For the benchmark I posted this solved the issue, but my original net is still slower in C++. Further investigation needed. Will close the issue for now and reopen if necessary. Is that OK?
Sure, thanks!
On Wed, Jun 14, 2017 at 10:14 AM, July-Morning notifications@github.com wrote:
@alextp https://github.com/alextp Oh, no, sorry, my mistake. Thank you a lot. For the benchmark I posted this solved the issue, but my original net is still slower in C++. Further investigation needed. Will close the issue for now and reopen if necessary. Is that OK?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-308497350, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxe_H1mwCPWhPQE6slx9odVCnWRV8ks5sEBThgaJpZM4N4KAA .
--
For quite a time I was busy with another project, but finally I got back to that issue again. I found out that the reason was not moving from Python to C++ but freezing the graph. Even when I load frozen graph into Python code I get my model running 2-3 times slower. Same problem was reported here. The question is now: is it possible to load tensorflow model to C++ without freezing the graph? Or is it possible to speed up running session on a frozen graph?
cPython prof files are attached in case they would be useful: prof.zip Only tensorflow session.run was profiled. demo_session.prof is for restoring from the checkpoint case, demo_graph.prof is for loading the frozen graph case.
Ah, I see. Are you using GPU variables? Maybe the graph freezing code isn't placing them correctly.
On Tue, Jul 25, 2017 at 8:30 AM, July-Morning notifications@github.com wrote:
For quite a time I was busy with another project, but finally I got back to that issue again. I found out that the reason was not moving from Python to C++ but freezing the graph. Even when I load frozen graph into Python code I get my model running 2-3 times slower. Same problem was reported here https://github.com/tensorflow/tensorflow/issues/3216. The question is now: is it possible to load tensorflow model to C++ without freezing the graph? Or is it possible to speed up running session on a frozen graph?
cPython prof files are attached in case they would be useful: prof.zip https://github.com/tensorflow/tensorflow/files/1173861/prof.zip Only tensorflow session.run was profiled. demo_session.prof is for restoring from the checkpoint case, demo_graph.prof is for loading the frozen graph case.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-317774700, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxcDvCj1rOtqaUhjhHfYy4plA4E_uks5sRgongaJpZM4N4KAA .
--
I have no GPU on my computer (and tensorflow is built in cpu-only mode). And I tried to remove manually all the parts mentioning GPU from graph.pb - didn't help at all.
Is it feasible for you to not freeze your graph, then?
On Tue, Jul 25, 2017 at 8:51 AM, July-Morning notifications@github.com wrote:
I have no GPU on my computer (and tensorflow is built in cpu-only mode). And I tried to remove manually all the parts mentioning GPU from graph.pb
- didn't help at all.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-317779295, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxfV8vNl3macwk8XPk-uPIcnlymofks5sRg8HgaJpZM4N4KAA .
--
Adding @petewarden as he might have a better understanding than I do about frozen graphs
I need to use my trained model in C++ project. I couldn't find a way to load it without freezing. If it is possible, could you please give me a clue how to do that?
Graph freezing turns all parameters into constants. Historically constants have been less performant than variables (ie, stored in the graph datastructure which has additional locking).
Got it, but is there any way to use model I have in C++ project without 2-3x loss in performance? Maybe it is possible to load model without freezing it? Any example or workaround would be really useful, because now I have real-time when I restore a model in Python and can't achieve anything close in C++.
How about SavedModel?
Thank you! I will try SavedModel and report the results here.
To expand on Yaroslav's comments, there are some potential initialization overheads due to the way we copy GraphDefs, but we don't expect performance to be noticeably slower on real models (and can actually be faster if you use memmapping).
With that said, using MNIST as a benchmark probably isn't very useful, since the amount of computation involved is very small and so initialization and other overheads will predominate. Have you looked at https://www.tensorflow.org/performance/benchmarks for some more representative models?
I am not using MNIST benchmark anymore, it was only provided as an example here. What I am working with is a more complicated net (modified SqueezeNet on a 800x450 image), so I don't think initialization is a main issue here. And I am comparing only performance of session.run itself on a set of consequent runs in a loop. Seems that memmapping and SavedModel are the next steps for me.
Again, thank you all for incredibly helpful and quick responses!
More questions than answers for now:
1) I tried memmapping of the frozen graph with convert_graphdef_memmapped_format, but when trying to load mapped frozen graph I get parse/reading errors both in C++ (ReadBinaryProto) and Python (import_graph_def). Should I do memmapping before or after freezing the graph? Haven't tried the second option cause I have initial graph.pb in text, not binary format.
2) Is there a working example of saving model into SavedModel? I found these, but it is not really clear what to do with FasterRCNN-like architecture (which is my case), when one has also bounding boxes as output and may have more that one input. Specifically I am confused with tf.saved_model.signature_constants.
Re (2) you can use a custom signature for your model instead of the standard classification / regression ones. You can make a signaturedef by filling out the proto in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/meta_graph.proto#L293 and pass it in the signature_def_map of your savedmodel as explained in https://tensorflow.github.io/serving/serving_basic.html
On Wed, Jul 26, 2017 at 8:13 AM, July-Morning notifications@github.com wrote:
More questions than answers for now:
1.
I tried memmapping of the frozen graph with convert_graphdef_memmapped_format, but when trying to load mapped frozen graph I get parse/reading errors both in C++ (ReadBinaryProto) and Python (import_graph_def). Should I do memmapping before or after freezing the graph? Haven't tried the second option cause I have initial graph.pb in text, not binary format. 2.
Is there a working example of saving model into SavedModel? I found these https://github.com/tensorflow/serving/tree/master/tensorflow_serving/example, but it is not really clear what to do with FasterRCNN-like architecture (which is my case), when one has also bounding boxes as output and may have more that one input. Specifically I am confused with tf.saved_model.signature_constants.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-318084041, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxdcxgjGyW5jg36VImVvLPqi86Zctks5sR1ehgaJpZM4N4KAA .
--
Great, I will try! Thank you!
Well, I managed to save my model in SavedModel format and load it back with Python, but running time is as slow as when using a frozen graph.
Here is how I save my model: ` export_path = './saved_model' + str(datetime.now()) + '/' builder = tf.saved_model.builder.SavedModelBuilder(export_path)
tensor_info_im = tf.saved_model.utils.build_tensor_info(model.image_input)
tensor_info_pr = tf.saved_model.utils.build_tensor_info(model.keep_prob)
tensor_info_bb = tf.saved_model.utils.build_tensor_info(model.det_boxes)
tensor_info_sc = tf.saved_model.utils.build_tensor_info(model.det_probs)
prediction_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={'image_input:0': tensor_info_im, 'keep_prob:0':tensor_info_pr},
outputs={'bbox/trimming/bbox:0': tensor_info_bb, 'probability/score:0': tensor_info_sc},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={'predict': prediction_signature}, clear_devices = True)
builder.save()
`
And this is how I load it:
tf.saved_model.loader.load(sess, ["serve"], export_dir)
Is something wrong here?
Something else that I noticed: when I am running a restored tf session in a loop, first two runs are approximately as slow as with frozen graph or SavedModel. However after that it gets much (2x-3x) faster.
In general the first run of a tensorflow session is slower as a lot of once-only computation gets performed (graph pruning, GPU set up, etc)
On Thu, Jul 27, 2017 at 7:31 AM, July-Morning notifications@github.com wrote:
Well, I managed to save my model in SavedModel format and load it back in Python, but running time is as slow as while using a frozen graph.
Here is how I save my model: ` export_path = './saved_model' + str(datetime.now()) + '/' builder = tf.saved_model.builder.SavedModelBuilder(export_path)
tensor_info_im = tf.saved_model.utils.build_tensor_info(model.image_input) tensor_info_pr = tf.saved_model.utils.build_tensor_info(model.keep_prob) tensor_info_bb = tf.saved_model.utils.build_tensor_info(model.det_boxes) tensor_info_sc = tf.saved_model.utils.build_tensor_info(model.det_probs)
prediction_signature = ( tf.saved_model.signature_def_utils.build_signature_def( inputs={'image_input:0': tensor_info_im, 'keep_prob:0':tensor_info_pr}, outputs={'bbox/trimming/bbox:0': tensor_info_bb, 'probability/score:0': tensor_info_sc}, method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)) builder.add_meta_graph_and_variables( sess, [tf.saved_model.tag_constants.SERVING], signature_def_map={'predict': prediction_signature}, clear_devices = True)
builder.save()
`
And this is how I load it: tf.saved_model.loader.load(sess, ["serve"], export_dir)
Is something wrong here?
Something else that I noticed: when I am running a restored tf session in a loop, first two runs are approximately as slow as with frozen graph or SavedModel. However after that it gets much (2x-3x) faster.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/10669#issuecomment-318378089, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxWmtdA8CkqkSFd3zC_I4AEbbCEnOks5sSJ8qgaJpZM4N4KAA .
--
Yeah, I thought about it, too, but decided to mention just in case.
Any ideas? I have uploaded my scripts and model files to github: https://github.com/July-Morning/Tensorflow_model_utils I have restore_model.py running 2 times faster than load_frozen_graph.py or load_savedmodel.py for this specific model. May be that would help somehow...
I have also made a test on tensorflow built with GPU support (run all the scripts above on the computer with TitanX), of course everything is faster there, but the difference still exists (restoring model ~2x faster than loading frozen graph or saved model), so I have to say it does not seem to be an operation system or a specific build issue.
This issue is a bit confusing to follow, is the problem with C++ API, or is the issue with SavedModel?
Oh, I see. I will try to sum up shortly how it was and here we are now.
From this moment on MNIST benchmark was not used, all experiments were done on another net.
From this moment on I was using Python only!
@yaroslavvb advised me to try using SavedModel instead of freezing the graph, and I did (not sure if perfectly correct, but it worked) and got the same deceleration.
I have run my scripts on another computer (with GPU) and got the same difference in time, so it does not seem to be an operation system or a specific build issue.
So now the problem is as follows: I have real-time performance when I am using restored model, but I get a 2-3x slowdown when I am trying to load either frozen graph or SavedModel. The question is: why (or what am I doing wrong) and how to get rid of this effect? Model files and python scripts can be found here: https://github.com/July-Morning/Tensorflow_model_utils
Please let me know if it would be better to open another issue after all this mess.
Anything else I can do to make things more clear? Any tests or profiling? I would be really grateful for any tiny chance to solve that issue.
I cloned your repo and tried to run it and get
File "load_savedmodel.py", line 41
im = np.random.rand(IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_DEPTH)
^
TabError: inconsistent use of tabs and spaces in indentation
The best bug reports are the ones that provide a line number of where the bug is :) If I were you, I would first try reducing to a simplest possible example (ie, single variable, single op), maybe combined with profiling the code (TF timeline/snakeviz) to see where the slowness is. If you find the source of the problem, you could use git blame to find who added this line to the code, and cc them on the bug
Thank you for the advice. I will do my best to follow it and will definitely fix code in the repo :)
By the way, I have attached some profiling results with snakeviz earlier. Will do it again here: prof.zip
I have updated my repo, hope now everything is alright with the indentations. https://github.com/July-Morning/Tensorflow_model_utils Profiling is also done here, it shows main difference in _pywrap_tensorflow_internal.TF_Run and session.py(_run)
Minor update here: the slowdown seems to happen only in case of cpu version of tensorflow and in case of gpu version of tensorflow running on several GPUs. In case of tensorflow-gpu running on one GPU the times for all three cases are nearly equal.
Update: if I do training and evaluation with the same batch size (equal to 1), the slowdown disappears.
Finally found the reason of the problem: it was variable batch size.
Thank you all, I'm closing the issue!
Hi, guys One thing i cannot understand, am I missing the frozen_graph.pb to successfully start the program? Thanks a lot for answer.
@July-Morning Hi, I see you finally solved the problem, the reason is variable batch size. how did you solve this in your production environment? Do you mean the batch size of training and evaluation has to be the same? But in my situation, we often train in large batch size like 4096, but inference with small batch like 20, thus I would be definitely suffer from this problem, is this the situation here?
Thanks very much.
@July-Morning @yaroslavvb Hi guys i have a model which was trained on gpu . This model with python code takes approx 4 sec on linux cpu system , when i use the same model on the cpu i get the timing of 11-16 sec can you point out the problem y i get this much timing difference and how can i solve this
Thanks in advance
All fast programs are alike, every slow program is slow in its own way --Tolstoy
@yaroslavvb sorry i did not get you , are you quoting Tolstoy for me of giving me suggestion on it
@July-Morning @yaroslavvb Hi guys i have a model which was trained on gpu . This model with python code takes approx 4 sec on linux cpu system , when i use the same model on the cpu i get the timing of 11-16 sec can you point out the problem y i get this much timing difference and how can i solve this
I have the same problem and have not solved;
Adding @petewarden as he might have a better understanding than I do about frozen graphs
@petewarden @yaroslavvb Which will have good performance or speed improvements for Inference(only) with C++?
Or not C++ at all? will loading with Python be actually faster? or it depends on training parameters if any?
Update: if I do training and evaluation with the same batch size (equal to 1), the slowdown disappears.
Do you mean that the inference slowdown in tensorflow c++ disappears after you use the model which be trained and evaluated in python with the same batch size (equal to 1)? I have the same problem with you (Running session using c++ api is significantly slower than using python) and i have tried many solutions as the following shows: 1.compile tensorflow c++ shared library with opmizition flags:AVX/AVX2/SSE4.1/SSE4.2/FMA/XLA 2.MKL-DNN 3.replace single image inference in tensorflow c++ with batch inference https://stackoverflow.com/questions/57460782/batch-inference-is-as-slow-as-single-image-inference-in-tensorflow-c These above all are useless to me.
System information
No docker, no virtual environment. All tests were done using CPU only. All optimization flags are set, no warnings are shown. tcmalloc is also used. Batching also helps to increase the performance both for c++ and python, but the gap stays the same.
Describe the problem
I have written a simple benchmark based on official Deep MNIST example (https://www.tensorflow.org/get_started/mnist/pros). I create a simple convolutional net, train it on MNIST (number of training steps is small as we are interested in speed, not accuracy), freeze the graph and load it to c++. Then I do tests using python and using c++ and measure average time that it takes to run a tensorflow session. My tests show that running session in python takes ~2ms, while doing the same using c++ api is slower: ~3ms. The code of the benchmark can be found here: https://github.com/July-Morning/MNIST_convnet_tensorflow It should also be mentioned that the same tests were done for the multilayer perceptron. In that case c++ api was significantly (~7-10x times) faster than python just as was expected.