tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.18k stars 2.19k forks source link

Batching does not work for model with single dimension tf.example serialized input #1529

Closed raimondasl closed 4 years ago

raimondasl commented 4 years ago

I have a model with a single dimension input that is serialized tf.example. TF serving does not seem to perform batching with this model and the input. I have the following batching config:

max_batch_size { value: 32 }
num_batch_threads { value: 16 }
batch_timeout_micros { value: 100000 }
pad_variable_length_inputs: true
allowed_batch_sizes : 2
allowed_batch_sizes : 4
allowed_batch_sizes : 8
allowed_batch_sizes : 32

I have tried it with and without "pad_variable_length_inputs: true". I have tried batch_timeout_micros at various values, e.g. batch_timeout_micros { value: 1000 }. The value above is set to very high value to see if that would make the server to batch the incoming requests.
On the TF serving server side, I run tensorflow_model_server --port=8500 --model_name=mymodel --model_base_path=/serving --enable_batching=true --batching_parameters_file=/serving/batching_parameters.txt inside docker run -it -p 8500:8500 -p 8501:8501 tensorflow/serving:latest-devel The inference is running on CPU on a machine with 16 CPU cores. I run up to 16 clients on different machine issuing requests via gPRC against this server. The inference succeeds - I get correct inference results in all clients. However, the resulting inference time is the same as without batching enabled, which seems to indicate that TF Serving is not doing batching (just queuing the requests and running inference with one request in a batch).

TF Serving also does not provide functionality to log or inspect whether it is doing batching or not and what the individual batch sizes are. I have tried to enable multiple logging levels via TF_CPP_MIN_VLOG_LEVEL and there are no tensorflow_serving log lines that correspond to inference requests or batching. This is very disappointing.

I have the same model with the same inputs running on the same machine using Nvidia TRT IS server with batching timeout set to 1ms and there batching produces significant speedup. Plus TRT IS provides info about the batching that is easy to access to confirm that batching is occurring and that inference is done with batches that have more than one element. So the issue is not the model being unbatchable. The issue is not that batching won't help performance. The issue seems to be in how TF Serving handles the batching in this case.

So questions and requests:

  1. Is batching supposed to work with models that have single dimension tf.example serialized input?
  2. Can TF Serving provide a way to access batching functionality logs or at least information of how many requests got batched into a batch?

Thank you

gowthamkpr commented 4 years ago

@raimondasl This maybe the network io problem,you can use dstat to monitor your network interface.

You can also refer to the question here.

raimondasl commented 4 years ago

I have instrumented tensorflow_serving/batching/batchingsession.cc and I see that the batching is happening. At least when BatchingSession::ProcessBatch() is being executed, the inputs are merged in MergeInputTensors(), merged inputs are passed to wrapped->Run(), and results are split in SplitOutputTensors().

So then we are back to performance problem: why batched processing with tensorflow_serving is slower than nonbatched? With Nvidia TRT IS server, using similar batching settings, the batched request turnaround time (latency) is much faster than non-batched. With TF Serving, batched is slower. I'm running tests with:

max_batch_size { value: 16 }
num_batch_threads { value: 4 }
batch_timeout_micros { value: 1000 }
pad_variable_length_inputs: false
allowed_batch_sizes : 2
allowed_batch_sizes : 4
allowed_batch_sizes : 8
allowed_batch_sizes : 16

batching parameters and 4 clients accessing single server.

Are you saying that the batched slowdown might be due to network io?

If you prefer, you can close this case. I'll open another one if/when I figure out what's happening with batching performance.

Should I open feature request for TF Serving users to get access to batching information (number of requests processed, number of batches processed, compute time and queueing time) via HTTP status request?

Thanks

christisg commented 4 years ago

Hi @raimondasl,

Batching allows you to achieve better throughput, but it may come at a cost of increased average and/or tail latency, because the server will wait for accumulating a batch. There are plenty of tuning options to achieve the right balance between throughput and latency that fit your environment and QPS characteristics. I'd strongly recommend looking at github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md and trying the options mentioned there.

raimondasl commented 4 years ago

@christisg : actually batching can improve latency significantly. Regarding github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md, it has misleading advice.

As I have said above, I had to instrument tensorflow_serving/batching/batching_session.cc to get info on what batching does. I think that TF Serving project should provide the logging/instrumentation similar to what I put in. Without it, it's like flying blind.

Here are the things I discovered (which might be particular to my model/client situation and may not apply to everyone):

  1. If "allowed_batch_sizes : x" are specified, TF Serving always uses the next available batch size and "pads" the batch with empty tensors to fill it. This hits performance, especially in cases where there's a single request and larger batch size is used (2 or more).

  2. num_batch_threads - github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md suggests "_Set num_batchthreads to the number of CPU cores". This hurts performance a lot at least for my case. TF likes to use multiple CPUs for inference. If this is set to the number of CPU cores and the number of simultaneous requests is similar to the number of threads, this hits performance twice: First, there is no batching since every request gets its own thread. Second, TF cannot use multiple CPUs during inference. To see batching and improved request latency I have to set this 1 (or close to 1). This improves latency compared to non-batching about 1.15 to 1.4 times in my case.

There might be something funky happening with batch scheduling. If I have num_batch_threads 1 and 4 concurrent request streams, I see inferences flipping at batch sizes 1 and 3 (which is padded to 4). Even if I set batch_timeout_micros { value: 100000 }, which is way higher than the interval at which concurrent requests arrive, the inferences still run at sizes 1 / 3(4) for half of the test and then suddenly run at 4 for the remainder of the test. I am not going to debug this further though.

So you can close the case. Although I think that making actual batch size monitorable through some logging flag would be very useful for everyone.

Thanks

christisg commented 4 years ago

Thank you for your feedback. I think adding more logging is a good suggestion. Please feel free to file a feature request or contribute. As far as your comment about the general recommendations in the tensorflow_serving/batching/README.md, it would be great to better understand which part was misleading.

  1. The guideline states that allowed_batch_sizes is optional and should be set if for system architecture reasons you need to constrain the set of possible batch sizes. It might indeed make the performance worse for some configurations.
  2. The recommendation "Set num_batch_threads to the number of CPU cores" is mentioned as one possible approach, and is definitely not the answer for all environments. Have you tried other approaches given in the general guidance? Have you tried tuning max_enqueued_batches?

To summarize, the key here is experimentation and tuning. But without having a way to reproduce your use-case, I can't make any more specific recommendations.

gowthamkpr commented 4 years ago

@raimondasl Please reply to above comment so we can discuss further. Thanks!

raimondasl commented 4 years ago

@christisg and @gowthamkpr - I am sorry for delayed reply. I was busy with projects as well as running some more experiments with TF Serving and batching.

I have to check with my organization in terms of contributing the logging code. I'll get back when/if I know.

I have encountered an issue that likely has contributed to my previous results with TF Serving. I am running my experiments on Azure nodes running Debian. Recently Azure switched the nodes' underlying CPUs from Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz to Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz. (They may have changed other HW, but that's one thing visible). After the switch, I noticed that TF Serving uses a lot more cores when run in batching or non-batching scenarios. Unfortunately I no longer have access to the old Azure nodes, so I cannot rerun full set of experiments to see what exactly changed. However this has affected inference performance a lot. In short, I think that the previous CPU/node/HW ran too few TF Serving threads either at TF Serving level or at TF level for whatever reasons (bad thread scheduling?). This has contributed to some of my conclusions about batching that are no longer correct.

After rerunning experiments on the new node/CPU configuration, it seems that the best setup for batching is pretty minimal:

batch_timeout_micros { value: 10000 }

The latency (running 8 clients accessing single TF Serving server via gRPC) is the best with this setup. The latency is around 0.8x compared to no batching. It also uses fewer compute cores with CPU load at ~13.3 vs 15.0 for no batching. (Just FYI Nvidia TRT IS with batching shows the same latency as TF Serving with batching but shows lower CPU load at 9.8 for the same client size/workload.)

Setting num_batch_threads to number of CPU cores increases latency 1.20x compared to above mentioned minimal (optimal?) batching.

With all this said, I would suggest changing tensorflow_serving/batching/README.md by either removing or making more neutral "_num_batchthreads equal to the number of CPU cores" part. Maybe it should say something like "_do not set num_batch_threads; if needed experiment with num_batchthreads settings from 1 to the number of CPU cores".

Thanks for mentioning max_enqueued_batches. I don't think this is an issue in my current experiments, but I will keep it in mind for the future.

I am not sure if this case is sufficiently interesting to community. If not, I'd be happy to discuss anything further via any other channels.

Thank you all for responses and for the code base that was pretty easy to read, understand and change.

raimondasl commented 4 years ago

I ran another experiment with 16 clients accessing single TF serving server on 16 core CPU machine. Setting num_batch_threads to 4 outperformed not setting num_batch_threads at all. Unfortunately, the result is opposite for 8 clients. So at least in my use case there is no single best batching configuration across different number of clients.

christisg commented 4 years ago

@raimondasl, I don't have full visibility into your set-up, but assuming the aggregated traffic coming from 8 clients is half the traffic coming from 16 clients, it's possible that the overhead of more threads outweighs the benefits of parallelism. Also with lower QPS there is more chance of waiting more before executing a batch (depending on your batch_timeout_micros setting).

Thanks for your suggestion for documentation improvement. We'll look into updating it to avoid confusion.

vqbang commented 4 years ago

@raimondasl Excuse me, but have you ever tried to test a single request with n (e.g. 4) examples and monitored if it is using batching? I think that the model doesn't do batching in this case. Hope you can give me some advices. I'd tried to figure out what's going on for a week and also frustrated about the logging of tensorflow serving too :(

leo-XUKANG commented 4 years ago

@raimondasl thanks to your work, it helps me a lot

rmothukuru commented 4 years ago

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!