Batching in tensorflow serving

R-Miner commented 6 years ago

Hello Vetal1977 Wondering whether you attempted doing batching with tensorflow serving. Any help is appreciated if you could guide me in how to do. I tried using https://stackoverflow.com/questions/42519010/how-to-do-batching-in-tensorflow-serving as a reference but is not helping any way.

Any help is appreciated.

vbezgachev commented 6 years ago

Hello @R-Miner , sorry for a delayed reply. That worked for me too. First I read the files and loaded images into the list:

    path = 'performance'
    filenames = [(path + '/' + f) for f in listdir(path) if isfile(join(path, f))]
    files = []
    imagedata = []
    for filename in filenames:
        f = open(filename, 'rb')
        files.append(f)

        data = f.read()
        imagedata.append(data)

and then called the prediction:

    print('In batch mode')
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'inception'
    request.model_spec.signature_name = 'predict_images'

    request.inputs['images'].CopyFrom(
        tf.contrib.util.make_tensor_proto(imagedata, shape=[len(imagedata)]))
    result = stub.Predict(request, 10.0)  # 10 secs timeout

What kind of error did you get?

R-Miner commented 6 years ago

Thanks Vetal for your response. Glad to hear that it worked for you.

Mine is a custom model, converted from a keras .h5 weight file, 'saved_model.pb'.

The model is exported in such a way that it accepts only one image with shape none. So not allowing me to send a len(imagedata) attribute at make_tensor_proto.

Could you explain more on what all changes you did in exporting the model in order to do batching?

Thanks R On Jun 11, 2018 8:48 PM, "Vitaly Bezgachv" notifications@github.com wrote:

Hello @R-Miner https://github.com/R-Miner , sorry for a delayed reply. That worked for me too. First I read the files and loaded images into the list:

path = 'performance'
filenames = [(path + '/' + f) for f in listdir(path) if

isfile(join(path, f))] files = [] imagedata = [] for filename in filenames: f = open(filename, 'rb') files.append(f)

    data = f.read()
    imagedata.append(data)

and then called the prediction:

print('In batch mode')
request = predict_pb2.PredictRequest()
request.model_spec.name = 'inception'
request.model_spec.signature_name = 'predict_images'

request.inputs['images'].CopyFrom(
    tf.contrib.util.make_tensor_proto(imagedata, shape=[len(imagedata)]))
result = stub.Predict(request, 10.0)  # 10 secs timeout

What kind of error did you get?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Vetal1977/tf_serving_example/issues/4#issuecomment-396456957, or mute the thread https://github.com/notifications/unsubscribe-auth/ANC_l1-ECFoeuyVyo7uscsBAUvsHUgVCks5t7zn0gaJpZM4UbQLj .

vbezgachev commented 6 years ago

Could you give me a hint, how did you export the model?

R-Miner commented 6 years ago

github.com/tensorflow/serving/issues/878#issuecomment-389160104

The above link takes you to an issue I opened when I encountered issues in converting string input to float32.

That is how I do the export.... I dont use the map_fn. Still not getting right preductions though. I am converting string input to float because my model input should be a float32 type.

vbezgachev commented 6 years ago

I suspect you need to create input placeholder and parse the input as described here: https://github.com/tensorflow/serving/blob/7c7fc37878265bda84a857aa45798a16c2617c35/tensorflow_serving/example/inception_saved_model.py#L69-L76. To my understanding, you need to call tf.parse_example() to work this properly.

R-Miner commented 6 years ago

Thanks Vetal!!! Will try it your way.... would be a big success if I get it working.... was behind it for a long time now....cant try it until next week though.... will update you then....

Also I am trying to see whether I can use tensorflow serving in the following way:

start the server---->input tensor to model 1----> model1 does the prediction ----> model1 's prediction result as input to model 2-----> model 2 does the prediction---->model 2's prediction result as input to model 3----> do some processing of data according to preduction result and ---->return the corrected data back to client....all in one server call.

Any thoughts on the above scenario is appreciated!!

On Tue, Jun 12, 2018, 06:52 Vitaly Bezgachv notifications@github.com wrote:

I suspect you need to create input placeholder and parse the input as described here: https://github.com/tensorflow/serving/blob/7c7fc37878265bda84a857aa45798a16c2617c35/tensorflow_serving/example/inception_saved_model.py#L69-L76 . To my understanding, you need to call tf.parse_example() to work this properly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Vetal1977/tf_serving_example/issues/4#issuecomment-396596672, or mute the thread https://github.com/notifications/unsubscribe-auth/ANC_l3sB4ntGSTtlMgyjmAdk7TzRe3Anks5t78eFgaJpZM4UbQLj .

vbezgachev commented 6 years ago

I have also updated the export of my own model and the client to call a server in a batch mode. If it helps you, please take a look: https://github.com/Vetal1977/tf_serving_example/commit/8407dd1b692a0a690bb71de149c6b03b6648a4ea

Regarding your model chain - why do you want to do that? I needed something like this when I wanted to use pre-trained DenseNet model with my own classifier on top of it. Is your case similar to this?

pharrellyhy commented 6 years ago

Hi @Vetal1977

What if we have 100 different users and each of them sends one request at same time? So in this case, we have to wait until all the requests arrived and batch them?

Could you give some advices that our model can handle more requests simultaneously? In my case, including preparing the data, converting to tensor_proto and inference, processing 500 requests will take about 10s.

BTW, do you know how to create tf serving warmup data file? When launching the tf serving, No warmup data file found at /tensorflow-serving/finger-detection-serving/a4-versions/3/assets.extra/tf_serving_warmup_requests prints on screen, but I can't find any resources to create it. Thanks!

vbezgachev commented 6 years ago

Hi @pharrellyhy

The GPU is a bottleneck. Though HTTP or gRPC server can serve simultaneous requests from multiple users, you run the inference on a single GPU. Batching is definitely a method to improve performance since you're sticking images together and send them to GPU for prediction at once. In your case, I would prepare batches (say 16 or 32 images) on the client side and send them to TensorFlow server. I don't know, what is a warmup data file. Where does it come from?

pharrellyhy commented 6 years ago

Hi @Vetal1977 Thanks. Tensorflow serving will check this 'warmup' file and you can find the output from command line once it started. If I understand correct, it will do the warmup step at the very beginning so our first request will take the same time as the following requests.

Let's stick to single GPU for now. Since the GPU has it own computation power, so running one model for inference will have the same performance compared to multiple models. Am I right?

Another question is can we run multiple tf serving in different processes? I'm trying to use Gunicorn to do that but failed. Do you have any thoughts on this? Thanks!

pharrellyhy commented 6 years ago

Hi, @Vetal1977 I'm running load testing and I found the GPU utilization is only around 10%. Do you know how to increase the utilization? Thanks!

vbezgachev commented 6 years ago

Hi @pharrellyhy

I think this answer can clarify the usage of a single GPU for a simultaneous run of multiple models. Shortly, it makes not much sense because you have to split the GPU memory between processes, which in its turn slows down training and execution. You can run tf_serving in different processes - I did that with uswgi. Here you can find a Dockerfile and here - parameters for uswgi.

GPU utilization - how do you measure it? I created a sample application (Angular app + Node.js API + TensorFlow Serving). If I send the images one-by-one without waiting for results, rather relying on a callback then I get a GPU utilization at around 85-90%.

Regarding warm-up data. You are right it is a step at the beginning to prepare a server (otherwise the first request takes much longer than the following requests). It is an asset that should be saved along with the model. I didn't find in the documentation how do I specify assets.extra and which format does it have. The most close answer, I found, is here - see assets_collection

pharrellyhy commented 6 years ago

Hi, @Vetal1977 Thanks for your kindly reply as always.

I'm using Gunicorn as wsgi and it looks similar to the uwsgi. I also checked your repository and then I got lost. I know uwsgi can create multiple processes but how does it run /serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=gan --model_base_path=/serving/gan-export &> gan_log & to create tf serving multiple times? I can't see clearly how this command being executed. Not quite familiar with Docker :( (I'm going to check the docker compose doc.)

Currently I'm using Locust to do load testing. So when I running the load testing at ~200 RPS, the GPU utilization is around ~10% (I'm using nvidia-smi to measure it).

vbezgachev commented 6 years ago

Hi @pharrellyhy

Sorry, my bad, I didn't understand you - I'm using uSWGI on top of a Flask application.

I just made a simple experiment - I ran 2 Docker containers with TensorFlow Serving and started servers in parallel, so I had 2 processes that are able to get the requests. The problem was that one of it always failed since they were competing for a GPU. I do not see any sense in multiple processes of TensorFlow Serving on a machine with a single GPU. Furthermore, I suppose, TensorFlow uses it in exclusive mode for its computational graphs. If we would have 2 GPUs then we could start 2 TensorFlow servers where each of them uses a dedicated GPU. Then we could put Nginx in front of those and load balancing the traffic to the servers; see here.

I used my own test client and just added a loop for issuing the requests for a longer period. I got the utilization over 75% in one-by-one mode and over 85% in batch mode. What do you do in the Locust task? Do you just issue a PredictRequest or a bit more (load image file, for instance)? And you should keep in mind that nvidia-smi has a monitoring cycle. That means it summarizes a statistic for a short period of time (say 500-1000 ms) and if you have an utilization peak for 100 ms, you don't see it clearly in the statistic.

pharrellyhy commented 6 years ago

Hi @Vetal1977 Sorry if I didn't make it clear.

So, what you are saying is that if we want to serve 2 tf servers, we have to run 2 docker containers and both of them run their own tf server and listen to the same port, right? For now, I tried to run, say, 2 tf servers in the same container. I'm not quite sure if it is working or not since I can't make tf serving to log more useful information. How do you setup the logging for tf serving?

vbezgachev commented 6 years ago

Hi @pharrellyhy

Not exactly - you start 2 Docker containers but they publish different ports, say -p 9000:9000 and -p 9001:9000. In both containers, you start TensorFlow Serving in the same way tensorflow_model_server --port=9000. You can start 2 instances of TensorFlow Serving in the same container successfully. But when you issue a request from the client application one of them crashes. I had such errors as: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED and Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR. If you want to log more you can do that by setting the environment variable in running Docker container export TF_CPP_MIN_VLOG_LEVEL=3 as described here

pharrellyhy commented 6 years ago

Thanks @Vetal1977 . It's really helpful. I'm going to give it a try and let's see what I get. Thanks again!

vbezgachev / tf_serving_example

Batching in tensorflow serving #4