tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.18k stars 2.19k forks source link

Tensorflow Serving runs stock Xception models very slowly. #1564

Closed patha454 closed 1 year ago

patha454 commented 4 years ago

Bug Report

System information

Describe the problem

We've been attempting to deploy a machine learning solution with Tensorflow Serving on an embedded device (Jetson Xavier [ARMv8])

One model used by the solution is a stock Xception network from tf.keras.applications. Running the Xception model on the device in vanilla Tensorflow generates reasonable performance - about 0.1s to predict, ignoring all processioning. We get similar results running the Xception model on a RTX2080 server and x86 laptop, with the model running at reasonable speeds in Tensorflow - Between 1 second per CPU prediction and 0.01s per GPU prediction depending on the hardware capabilities.

However, once the model is running in a Tensorflow Serving GPU container, via Nvidia-Docker, the model is much slower - about 3s to predict.

I've been trying to isolate the cause of the poor performance; So far I've tested:

I would expect TF Serving to provide the same (more or less, allowing for GRPC encoding and decoding) prediction time as TF, and they do for other models I'm running. None of my efforts have got Xception upto the ~0.1s performance I would expect on the embedded device. In light of our tests on the server and laptop with pre-built binaries, we're now confident this is a problem with Tensorflow Serving itself, and not our infrastructure or build.

Another team member opened a ticket on Stack Overflow before we were certain the problem was in Tensorflwo Serving. This bug report is based on the Stack Overflow ticket: [https://stackoverflow.com/questions/60482185/why-would-tensorflow-serving-run-a-model-slower-than-tensorflow-on-a-keras-stock]

Exact Steps to Reproduce

The stock Xception model was generated from Tensorflow, with:

xception = tf.keras.applications.Xception(include_top=False, input_shape=(299, 299, 3), pooling=None
xception.save("./saved_xception_model/1", save_format="tf")

Running the Xception model on the device generates reasonable performance - about 0.1s to predict, ignoring all processioning:

xception = tf.keras.models.load_model("saved_xception_model/1", save_format="tf")
image = get_some_image() # image is numpy.ndarray
image.astype("float32")
image /= 255
image = cv2.resize(image, (299, 299))
# Tensorflow predict takes ~0.1s
xception.predict([image])

I start a Tensorflow Serving container with the following arguments on the Jetson Xaver.

tf_serving_cmd = "docker run --runtime=nvidia -d"
tf_serving_cmd += " --name my-container"
tf_serving_cmd += " -p=8500:8500 -p=8501:8501"
tf_serving_cmd += " --mount=type=bind,source=/home/xception_model,target=/models/xception_model"
tf_serving_cmd += " --mount=type=bind,source=/home/model_config.pb,target=/models/model_config.pb"
tf_serving_cmd += " --mount=type=bind,source=/home/batching_config.pb,target=/models/batching_config.pb"

# Self built TF serving image for Jetson Xavier, ARMv8.
tf_serving_cmd += " ${MY_ORG}/serving" 
# I have tried 0.5 with no performance difference. 

# TF-Serving does not complain it wants more memory in either case.
tf_serving_cmd += " --per_process_gpu_memory_fraction:0.25"
tf_serving_cmd += " --model_config_file=/models/model_config.pb"
tf_serving_cmd += " --flush_filesystem_caches=true"
tf_serving_cmd += " --enable_model_warmup=true"
tf_serving_cmd += " --enable_batching=true"
tf_serving_cmd += " --batching_parameters_file=/models/batching_config.pb"
Arnold1 commented 4 years ago

@patha454 a few questions:

patha454 commented 4 years ago

@patha454 a few questions:

* how does your batching_parameters_file look like?

* do you use http or grpc?

* how many requests per second do you send to TensorFlow Serving?

* how many grpc connections, how many threads per connection?

Thanks Arnold.

I've experimented with other values, although not very rigorously. In any case other models with similar variable count, such as a custom full YOLO, do not suffer this bug.

gowthamkpr commented 4 years ago

Thank you for reporting the problem. Can you take a look at the performance guide, and if that does not address your problem, profile your inference requests with Tensorboard?

patha454 commented 4 years ago

Thank you for reporting the problem. Can you take a look at the performance guide, and if that does not address your problem, profile your inference requests with Tensorboard?

Thanks @gowthamkpr .

I can confirm the performance guide could not address my issues.

I've created Tensorboard Traces using the x64 laptop (no GPU, tensorflow/serving:latest), and the x64 server (RTX2080, tensorflow/serving:latest-gpu), so both are now TF Serving 2.1 installs.

My profile traces are hosted on Github Gists:

I'm not familiar with Tensorboard and interpreting traces. As far as I can tell the trace for the CPU only laptop appears to show only a single core is used, despite eight being available. There appears to be no particular bottleneck, everything is simply "a bit" slow and it adds up.

The trace from the GPU server appears empty, and doesn't profile the GPU itself. I'm not sure if this is related to my issue, or I'm misusing Tensorboard... Probably the latter.

Despite the massive hardware difference between the CPU only laptop and RTX server, both predict requests took ~2.4 seconds clientside (Based on timing the PredictionServiceStub.Predict method call.) Each trace comes from the second predict call to the model, after running one predict to initialize and warm up the servers.

I reiterate that both these machines run the model much faster in vanilla Tensorflow: ~0.07s on the RTX server, and 1s on the CPU laptop. Other models with similar layer and variable counts run at a latency indistinguishable from vanilla Tensorflow, so I don't think Tensorflow Serving's, well, Serving overhead is the issue here. The CPU trace does suggest the actual computation took only 250ms, though. I suspect at this point someone more familiar with the traces and TF Core can make more sense of the data that I can.

I'm about to head out of office for the weekend, so I'll probably get back to you early next week in the unlikely case you respond over the weekend.

Thanks for your support so far, and have a good weekend!

Arnold1 commented 4 years ago

@patha454 @gowthamkpr I also want to profile it in my environment, is there a way to profile rest port (8501) with TensorBoard?

@patha454 could you problem be related to https://github.com/tensorflow/tensorboard/issues/3256 ?

patha454 commented 4 years ago

@patha454 @gowthamkpr I also want to profile it in my environment, is there a way to profile rest port (8501) with TensorBoard?

@patha454 could you problem be related to tensorflow/tensorboard#3256 ?

As far as I understand Tensorboard (which is "not much"), Tensorboard profiles all requests. The GRPC port is just used by Tensorboard to request diagnostic data via undocumented(?) APIs.

From that I'd assume that you can just expose both the GRPC port (for Tensorboard's diagnostics) and the REST port (for actually making queries) with docker run -p 8500:8500 -p 8501:8501 ... and then make REST requests to profile.

As for the TB issue with the near-empty GPU trace, possibly. I don't recall seeing anything about libcupti, but I wasn't looking for it either. I'll have a look on Monday when I'm next in office. I'm hoping the CPU traces will be enough for now, seeing as all the CPU systems have the same issue.

All the best!

patha454 commented 4 years ago

Update:

I've profiled a known good model with a similar layer and variable count - a custom trained full YOLO - on the x64 laptop. The trace doesn't appear that different to the Xception trace on the laptop: Single core in use, ~150ms to compute.

The x64 laptop YOLO trace is hosted on GitHub Gists.

This model does, however, run in about the same time on TF Serving as in vanilla Tensorflow on the same machine: About one second.

@Arnold1 I can now confirm that the GPU traces not showing GPU activity is related to tensorflow/tensorboard#3256. I am seeing the libcupti errors. My attempts at the discussed workaround have failed. I can still recreate this issue accurately on CPU versions, so I believe this secondary issue is not particularly significant to this particular ticket.

patha454 commented 4 years ago

Update:

I've attempted to write a minimal, isolated script to reproduce the issue, and failed. The script which fails to reproduce the issue is on Github Gists.

I've noticed the first call to TF and TFS in my isolated script has the expected very high latency for the first request when the model has to be initialized, then they drop down to reasonable and consistent times. I've noticed in our production code which includes the bug the warmup time is either not there, or perhaps it's better said as "always there": The first request is equally slow as latter requests. Could this suggest TFS is unloading the model immediately after use, or re-initializing it every time? Unfortunately I'm not at liberty to share the live code, sorry...

Edit: Update:

When I previously reproduced this issue on x64 machines - the laptop and server - I ran the TFS container on the x64 devices, but ran our client code on the Jetson Xavier and submitted requests over the network.

My failures to reproduce the issue with the previously mentioned minimal script have been compounded by my new failure to reproduce when running my client code on the same x64 device that runs the container. From this, I now suspect the issue is related to either my build of Tensorflow Serving, or my build of the Tensorflow Serving APIs on the ARM-based Xavier, with the latter being more likely.

I will attempt a rebuild of these binaries and let you know how it goes. Is it possible a version mismatch between the the TFS APIs and TFS (or their dependencies) could cause the observed performance drop?

nrobeR commented 4 years ago

I have a few questions on the problem to clarify my understanding.

patha454 commented 4 years ago

Hi @nrobeR,

I hope that helps!

only-yao commented 4 years ago

Is this problem solved now, I have the same problem.

singhniraj08 commented 1 year ago

@patha454,

Going through the whole thread, I came across the fact that Xception model performance in bad while making predictions on GPU and on CPU it uses single core despite having 8 cores.

For slow model performance on GPU, this issue might be because of the fact that the overhead of invoking GPU kernels, and copying data to and from GPU, is very high. For operations on models with very little parameters it is not worth of using GPU since frequency of CPU cores is much higher. Increasing the batch size will increase performance on the GPU. Try increasing the batch size to larger number(Ex. 500, 1000) for improved performance on GPU.

For single core usage on CPU issue, you can play around with tensorflow_intra_op_parallelism, tensorflow_inter_op_parallelism and rest_api_num_threads command line flags for better performance.

Please try the above steps and let us know if that helps. Thank you!

github-actions[bot] commented 1 year ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue was closed due to lack of activity after being marked stale for past 7 days.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No