Tensorflow Serving runs stock Xception models very slowly.

patha454 commented 4 years ago

Bug Report

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 18.4 LTS. (x64 server and x64 laptop)
Nvidia Linux for Tegra (Based on Ubuntu 18.4) (ARMv8 Nvidia Jetson Xavier)
TensorFlow Serving installed from (source or binary):
Source (on Jetson Xaver)
- Binary (on x64 server and x64 laptop).
TensorFlow Serving version:
2.0 GPU (Source install on Jetson Xaver.)
2.0 GPU and 2.0 CPU(Binary install on x64 server.)
2.1 CPU (Binary install on x64 laptop.)

Describe the problem

We've been attempting to deploy a machine learning solution with Tensorflow Serving on an embedded device (Jetson Xavier [ARMv8])

One model used by the solution is a stock Xception network from tf.keras.applications. Running the Xception model on the device in vanilla Tensorflow generates reasonable performance - about 0.1s to predict, ignoring all processioning. We get similar results running the Xception model on a RTX2080 server and x86 laptop, with the model running at reasonable speeds in Tensorflow - Between 1 second per CPU prediction and 0.01s per GPU prediction depending on the hardware capabilities.

However, once the model is running in a Tensorflow Serving GPU container, via Nvidia-Docker, the model is much slower - about 3s to predict.

I've been trying to isolate the cause of the poor performance; So far I've tested:

Tweaking TF Serving's batching parameters to go all out on latency (batch_timeout_micros: 0, max_batch_size: 1, and noticed a modest 0.5s gain in performance.
Optimizing the model with TensorRT via saved_model_cli.
Running the Xception model in isolation, as the only model being served by TF Serving.
Experimenting with doubling the memory allocated per TF process.
Experimenting with enabling and disabling batching altogether.
Experimenting with enabling and disabling model warmup.
Running the model in Tensorflow and Tensorflow Server on a old CPU only laptop and a RTX2080 server to verify the problem is not any of my code or infrastructure. I noticed that even running the tensorflow/serving (CPU) container on the 12-core server, top showed the CPU near idling. Perhaps this suggests a problem with I/O server side?

I would expect TF Serving to provide the same (more or less, allowing for GRPC encoding and decoding) prediction time as TF, and they do for other models I'm running. None of my efforts have got Xception upto the ~0.1s performance I would expect on the embedded device. In light of our tests on the server and laptop with pre-built binaries, we're now confident this is a problem with Tensorflow Serving itself, and not our infrastructure or build.

Another team member opened a ticket on Stack Overflow before we were certain the problem was in Tensorflwo Serving. This bug report is based on the Stack Overflow ticket: [https://stackoverflow.com/questions/60482185/why-would-tensorflow-serving-run-a-model-slower-than-tensorflow-on-a-keras-stock]

Exact Steps to Reproduce

The stock Xception model was generated from Tensorflow, with:

xception = tf.keras.applications.Xception(include_top=False, input_shape=(299, 299, 3), pooling=None
xception.save("./saved_xception_model/1", save_format="tf")

Running the Xception model on the device generates reasonable performance - about 0.1s to predict, ignoring all processioning:

xception = tf.keras.models.load_model("saved_xception_model/1", save_format="tf")
image = get_some_image() # image is numpy.ndarray
image.astype("float32")
image /= 255
image = cv2.resize(image, (299, 299))
# Tensorflow predict takes ~0.1s
xception.predict([image])

I start a Tensorflow Serving container with the following arguments on the Jetson Xaver.

tf_serving_cmd = "docker run --runtime=nvidia -d"
tf_serving_cmd += " --name my-container"
tf_serving_cmd += " -p=8500:8500 -p=8501:8501"
tf_serving_cmd += " --mount=type=bind,source=/home/xception_model,target=/models/xception_model"
tf_serving_cmd += " --mount=type=bind,source=/home/model_config.pb,target=/models/model_config.pb"
tf_serving_cmd += " --mount=type=bind,source=/home/batching_config.pb,target=/models/batching_config.pb"

# Self built TF serving image for Jetson Xavier, ARMv8.
tf_serving_cmd += " ${MY_ORG}/serving" 
# I have tried 0.5 with no performance difference. 

# TF-Serving does not complain it wants more memory in either case.
tf_serving_cmd += " --per_process_gpu_memory_fraction:0.25"
tf_serving_cmd += " --model_config_file=/models/model_config.pb"
tf_serving_cmd += " --flush_filesystem_caches=true"
tf_serving_cmd += " --enable_model_warmup=true"
tf_serving_cmd += " --enable_batching=true"
tf_serving_cmd += " --batching_parameters_file=/models/batching_config.pb"

Arnold1 commented 4 years ago

@patha454 a few questions:

how does your batching_parameters_file look like?
do you use http or grpc?
how many requests per second do you send to TensorFlow Serving?
how many grpc connections, how many threads per connection?

patha454 commented 4 years ago

@patha454 a few questions:

* how does your batching_parameters_file look like?

* do you use http or grpc?

* how many requests per second do you send to TensorFlow Serving?

* how many grpc connections, how many threads per connection?

Thanks Arnold.

The contents of my batching_parameters is currently:

max_batch_size { value: 1 }
num_batch_threads { value: 8 }
max_enqueued_batches { value: 100 }
batch_timeout_micros { value: 0 }

I've experimented with other values, although not very rigorously. In any case other models with similar variable count, such as a custom full YOLO, do not suffer this bug.

We're currently using GRPC.
Typical rates are between 10 and 0.3 requests per second.. We currently send requests synchronously from a single device. This is as fast as the server can reply to us from the previous request (and we can gather and preprocess the next input).
We currently have a single GRPC connection. The connection is shared between a variable number of threads: One per tool we're running on Tensorflow Server. Initially this was three threads, but when isolating this issue I ran only the Xception model and it's client thread, for a total of one thread per connection. I reproduced the issue with a single thread on the Jetson Xaver, Intel Laptop, and RTX server.

gowthamkpr commented 4 years ago

Thank you for reporting the problem. Can you take a look at the performance guide, and if that does not address your problem, profile your inference requests with Tensorboard?

patha454 commented 4 years ago

Thank you for reporting the problem. Can you take a look at the performance guide, and if that does not address your problem, profile your inference requests with Tensorboard?

Thanks @gowthamkpr .

I can confirm the performance guide could not address my issues.

I've created Tensorboard Traces using the x64 laptop (no GPU, tensorflow/serving:latest), and the x64 server (RTX2080, tensorflow/serving:latest-gpu), so both are now TF Serving 2.1 installs.

My profile traces are hosted on Github Gists:

I'm not familiar with Tensorboard and interpreting traces. As far as I can tell the trace for the CPU only laptop appears to show only a single core is used, despite eight being available. There appears to be no particular bottleneck, everything is simply "a bit" slow and it adds up.

The trace from the GPU server appears empty, and doesn't profile the GPU itself. I'm not sure if this is related to my issue, or I'm misusing Tensorboard... Probably the latter.

Despite the massive hardware difference between the CPU only laptop and RTX server, both predict requests took ~2.4 seconds clientside (Based on timing the PredictionServiceStub.Predict method call.) Each trace comes from the second predict call to the model, after running one predict to initialize and warm up the servers.

I reiterate that both these machines run the model much faster in vanilla Tensorflow: ~0.07s on the RTX server, and 1s on the CPU laptop. Other models with similar layer and variable counts run at a latency indistinguishable from vanilla Tensorflow, so I don't think Tensorflow Serving's, well, Serving overhead is the issue here. The CPU trace does suggest the actual computation took only 250ms, though. I suspect at this point someone more familiar with the traces and TF Core can make more sense of the data that I can.

I'm about to head out of office for the weekend, so I'll probably get back to you early next week in the unlikely case you respond over the weekend.

Thanks for your support so far, and have a good weekend!

Arnold1 commented 4 years ago

@patha454 @gowthamkpr I also want to profile it in my environment, is there a way to profile rest port (8501) with TensorBoard?

@patha454 could you problem be related to https://github.com/tensorflow/tensorboard/issues/3256 ?

patha454 commented 4 years ago

@patha454 @gowthamkpr I also want to profile it in my environment, is there a way to profile rest port (8501) with TensorBoard?

@patha454 could you problem be related to tensorflow/tensorboard#3256 ?

As far as I understand Tensorboard (which is "not much"), Tensorboard profiles all requests. The GRPC port is just used by Tensorboard to request diagnostic data via undocumented(?) APIs.

From that I'd assume that you can just expose both the GRPC port (for Tensorboard's diagnostics) and the REST port (for actually making queries) with docker run -p 8500:8500 -p 8501:8501 ... and then make REST requests to profile.

As for the TB issue with the near-empty GPU trace, possibly. I don't recall seeing anything about libcupti, but I wasn't looking for it either. I'll have a look on Monday when I'm next in office. I'm hoping the CPU traces will be enough for now, seeing as all the CPU systems have the same issue.

All the best!

patha454 commented 4 years ago

Update:

I've profiled a known good model with a similar layer and variable count - a custom trained full YOLO - on the x64 laptop. The trace doesn't appear that different to the Xception trace on the laptop: Single core in use, ~150ms to compute.

The x64 laptop YOLO trace is hosted on GitHub Gists.

This model does, however, run in about the same time on TF Serving as in vanilla Tensorflow on the same machine: About one second.

@Arnold1 I can now confirm that the GPU traces not showing GPU activity is related to tensorflow/tensorboard#3256. I am seeing the libcupti errors. My attempts at the discussed workaround have failed. I can still recreate this issue accurately on CPU versions, so I believe this secondary issue is not particularly significant to this particular ticket.

patha454 commented 4 years ago

Update:

I've attempted to write a minimal, isolated script to reproduce the issue, and failed. The script which fails to reproduce the issue is on Github Gists.

I've noticed the first call to TF and TFS in my isolated script has the expected very high latency for the first request when the model has to be initialized, then they drop down to reasonable and consistent times. I've noticed in our production code which includes the bug the warmup time is either not there, or perhaps it's better said as "always there": The first request is equally slow as latter requests. Could this suggest TFS is unloading the model immediately after use, or re-initializing it every time? Unfortunately I'm not at liberty to share the live code, sorry...

Edit: Update:

When I previously reproduced this issue on x64 machines - the laptop and server - I ran the TFS container on the x64 devices, but ran our client code on the Jetson Xavier and submitted requests over the network.

My failures to reproduce the issue with the previously mentioned minimal script have been compounded by my new failure to reproduce when running my client code on the same x64 device that runs the container. From this, I now suspect the issue is related to either my build of Tensorflow Serving, or my build of the Tensorflow Serving APIs on the ARM-based Xavier, with the latter being more likely.

I will attempt a rebuild of these binaries and let you know how it goes. Is it possible a version mismatch between the the TFS APIs and TFS (or their dependencies) could cause the observed performance drop?

nrobeR commented 4 years ago

I have a few questions on the problem to clarify my understanding.

your models run slower with GPU tensorflow_serving build running on Jetson Xavier compared to CPU tensorflow_serving and , is that correct? Is there a difference among different hardwares for the GPU build or are they all the same?
could you please summarize config comparisons on the different experiments you ran which help to narrow down the possible issues? (e.g. build version, hardware, docker container specs, qps, latencies etc)
from your latest update about failure to reproduce, it seems like you no longer encounter the issue? not sure if you enabled JIT on GPU. But that may also explain the first query latency.

patha454 commented 4 years ago

Hi @nrobeR,

So far, I've only found the one model, the Keras Xception model, which suffers this bug, and yes, it is only effecting the GPU build on the Jetson Xavier. Running the model on a CPU build is no slower than CPU performance on comparable machines, but the GPU performance is slower than the CPU build on the same hardware, and much slower than the GPU build on comparable desktop workstations and servers.
I've tested with TFS2.0 and 2.1 on the Jetson Xavier, a (GPU-less) desktop, and a GTX2080 server. I've found I can only reproduce the issue when I run my client code on the Xavier - even running the Tensorflow Serving container on a "good" Intel GPU machine over the network produces these results. I now suspect this might actually be an issue in my (ARM-Xavier) build of the Tensorflow Serving APIs... This is what all my previous "failure to reproduces" are about. I've been on leave since the day I discovered this, though, so I've not been able to dig much deeper. so really confirm these findings.
I do still encounter the issue. I only encounter it in quite niche circumstances: Running my client code and Tensorflow Serving APIs on the Jetson Xavier, which TFS (apparently) on device or hosted elsewhere.
I'm pretty sure the first query latency isn't related to the issue - this is just standard Tensorflow lazy initialization. I'm afraid I don't know how to set or check JIT compilation on GPUs, but yes, I suppose this could be it, too - although the first run latency isn't really an issue to me like the overall slowness on this config is.
In the absence of any further information, I'll probably try workarounds such as fusing the problematic model graph into other models. At this point the issue just makes no sense to me and I've run out of ideas for what it could be.

I hope that helps!

only-yao commented 4 years ago

Is this problem solved now, I have the same problem.

singhniraj08 commented 1 year ago

@patha454,

Going through the whole thread, I came across the fact that Xception model performance in bad while making predictions on GPU and on CPU it uses single core despite having 8 cores.

For slow model performance on GPU, this issue might be because of the fact that the overhead of invoking GPU kernels, and copying data to and from GPU, is very high. For operations on models with very little parameters it is not worth of using GPU since frequency of CPU cores is much higher. Increasing the batch size will increase performance on the GPU. Try increasing the batch size to larger number(Ex. 500, 1000) for improved performance on GPU.

For single core usage on CPU issue, you can play around with tensorflow_intra_op_parallelism, tensorflow_inter_op_parallelism and rest_api_num_threads command line flags for better performance.

Please try the above steps and let us know if that helps. Thank you!

github-actions[bot] commented 1 year ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue was closed due to lack of activity after being marked stale for past 7 days.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No