openvinotoolkit / model_server

A scalable inference server for models optimized with OpenVINO™
https://docs.openvino.ai/2024/ovms_what_is_openvino_model_server.html
Apache License 2.0
679 stars 212 forks source link

Using the correct configurations to get the best performance #234

Closed Steedance closed 4 years ago

Steedance commented 4 years ago

Hi,

I'm using openvino model server to run inference on multiple models. I've read the documentation but I'm not completely sure how I should set up the config.json. The target hardware is an Intel Xeon Silver 4216 (16 cores, 32 threads).

Below is what I have been using.

{
   "model_config_list":[
      {
         "config":{
            "name":"face-detection-retail-0004",
            "base_path":"/opt/ml/face-detection-retail-0004",
            "shape": "auto",
            "nireq": 8
         },
         "plugin_config": {"CPU_THROUGHPUT_STREAMS": 8, "CPU_THREADS_NUM": 32}
      },
      {
         "config":{
            "name":"age-gender-recognition-retail-0013",
            "base_path":"/opt/ml/age-gender-recognition-retail-0013",
            "batch_size": "auto",
            "nireq": 8
         },
         "plugin_config": {"CPU_THROUGHPUT_STREAMS": 8, "CPU_THREADS_NUM": 32}
      },
      {
         "config":{
            "name":"emotions-recognition-retail-0003",
            "base_path":"/opt/ml/emotions-recognition-retail-0003",
            "batch_size": "auto",
            "nireq": 8
         },
         "plugin_config": {"CPU_THROUGHPUT_STREAMS": 8, "CPU_THREADS_NUM": 32}
      },
      {
         "config":{
            "name":"head-pose-estimation-adas-0001",
            "base_path":"/opt/ml/head-pose-estimation-adas-0001",
            "batch_size": "auto",
            "nireq": 8
         },
         "plugin_config": {"CPU_THROUGHPUT_STREAMS": 8, "CPU_THREADS_NUM": 32}
      },
     {
        "config":{
           "name":"person-detection-retail-0013",
           "base_path":"/opt/ml/person-detection-retail-0013",
           "batch_size": "auto",
           "shape": "auto",
           "nireq": 8
         },
         "plugin_config": {"CPU_THROUGHPUT_STREAMS": 8, "CPU_THREADS_NUM": 32}
     },
     {
        "config":{
           "name":"person-reidentification-retail-0079",
           "base_path":"/opt/ml/person-reidentification-retail-0079",
           "batch_size": "auto",
           "nireq": 8
         },
         "plugin_config": {"CPU_THROUGHPUT_STREAMS": 8, "CPU_THREADS_NUM": 32}
     }
   ]
}

I'm using: "CPU_THROUGHPUT_STREAMS": 8 because this is what the benchmark app determined was the optimal setup. "CPU_THREADS_NUM": 32 because the hardware has 32 threads. "nireq": 8 because that would be the maximum requests we would send per model.

I have a few questions regarding the config.json file:

I also have another question regarding sending batches to OVMS (not sure if I should make another issue for this). I have noticed that the fps is lower when sending larger batches. For example, I made a container with just 1 model loaded onto it. I then sent 500 images into it (asynchronously) in batches of 1, 4, 10 and 50. Using batches of 4 processed the images the fastest. It is my understanding that using higher batches should produce a higher throughput, is this not the case when processing asynchronously?

Any help would be be appreciated.

dtrawins commented 4 years ago

@Steedance my first observation is that the plugin_config should be inside the config section for the model. Compare it with this config With number of streams you can balance between optimal throughput and latency. It shouldn't be higher than the number of expected parallel clients. Also make sure configured number of grpc workers exceeds the number of nireq. It is a global setting configurable as CLI param (not in config.json) Using streams in async mode might give similar gain in performance like using bigger batches. While you are using streams, big batches should not be needed to improve throughput.

mzegla commented 4 years ago

Hi @Steedance,

I'm not sure if CPU_THREADS_NUM set to 32 is really helpful here. OpenVINO does not use virtual cores and from my experience setting this param to number of threads didn't improve performance.

Optimal number of streams is quite big, but if you got that number from benchmark app then it's probably good for your setup. Although you are using multiple models and from my observations throughput mode works better than classic only if you can provide enough data to the model. In case you want to use those 6 models and have no more than 8 parallel inferences on each, you could try not setting plugin config at all (use default settings) and set grpc_workers parameter to 48. It should also work better if you don't have big, constant loads of data incoming.

About that grpc_workers parameter. As Darek said, it's arbitrary value and should be set specifically for configuration. In your config, if you expect all models being used with full capacity (processing 8 requests each) most of the time, then setting 48 grpc_workers seems like a good idea. If you'll for example use one model at a time, then grpc_workers=8 might be a good fit. grpc_workers is defined from CLI for whole OVMS.

For performance with batches, could you provide a script you were using and how grpc_workers was set?

Steedance commented 4 years ago

Thanks for the replies.

I will try the suggestions when I get some time. Could one of you please give me a brief explanation on what a stream actually is? Also I should mention that my use case is one client processing images as fast as possible, so it will be requesting inference constantly. In this case, does your (@dtrawins) comment relating to streams being no higher than expected parallel clients still stand?

@mzegla,

This is the docker run command I use to launch the model server: sudo docker run --rm -d --net=host --env LOG_LEVEL=DEBUG --name openvino-server test:fdpd /ie-serving-py/start_server.sh ie_serving config --config_path /opt/ml/config.json --port 9001 --grpc_workers 16 Note this is a test server and script so it is slightly different to my main use case This test server only has 1 model (person detection) and I have tried different values for grpc_workers with seemingly no difference in performance.

Below is the script I used where I noticed the performance was best with batch size of 3-5:


import cv2
import numpy as np
import tensorflow.contrib.util as tf_contrib_util
import time
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from tensorflow_serving.apis import get_model_metadata_pb2
import threading
import random

def getInferenceData(input_data, input_name, model_outputs, modelName):
    request = predict_pb2.PredictRequest()
    request.model_spec.name = modelName
    request.inputs[input_name].CopyFrom(tf_contrib_util.make_tensor_proto(input_data, shape=(input_data.shape)))

    response = stub.Predict(request, 30)
    return response

def imagePreparation(image, width, height):
    preparedData = []
    for eachImage in image:
        c, h, w = 3, height, width
        in_frame = cv2.resize(eachImage, (w, h))
        in_frame = in_frame.transpose((2, 0, 1))
        in_frame = in_frame.reshape((c, h, w))
        preparedData.append(in_frame)
    return np.stack(preparedData)

def detectPeople(frame, num):
    print(num, " is starting")
    response = getInferenceData(frame, 'data', 'detection_out', 'person-detection-retail-0013')
    newResponse = tf.make_ndarray(response.outputs["detection_out"])
    print(num, " is done")

grpc_port = '9001'
grpc_address = '0.0.0.0'
max_message_length =  100 * 1024 * 1024
options = [('grpc.max_receive_message_length', max_message_length)]
channel = grpc.insecure_channel("{}:{}".format(grpc_address, grpc_port), options=options)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

deadFrame = np.zeros((320, 544, 3), np.uint8)
deadFrame[:] = (0, 0, 0)
img = deadFrame
img2 = deadFrame.copy()
personBatch = []
maxImages = 500
batchSize = 4

personJobs = []
personInputData = []

arrayOfBatchSizes = []
imagesLeft = maxImages
i = 0
while imagesLeft > 0:
    personBatch = []
    #batchSize = random.randint(1, 10)
    if batchSize > imagesLeft: batchSize = imagesLeft
    imagesLeft -= batchSize
    for j in range(0, batchSize):
        personBatch.append(img)
    personInputData.append(imagePreparation(personBatch, 544, 320))
    personJobs.append(threading.Thread(target=detectPeople, args=(personInputData[i], i + 1)))

    arrayOfBatchSizes.append(len(personInputData[i]))
    i += 1

print(arrayOfBatchSizes, "-", len(arrayOfBatchSizes),"batches")
start = int(time.time() * 1000)
print("starting async")

for j in personJobs:
    j.start()
for j in personJobs:
    j.join()

duration = int(time.time() * 1000) - start
print(duration, " ms")
frames = maxImages
print(frames / (duration / 1000), "fps") ```

I appreciate the support.
mzegla commented 4 years ago

Ok, so from my understanding CPU_THROUGHPUT_STREAMS specifies number of streams i.e. groups of cores that would separately handle inference requests. So if you have 16 cores and 2 streams, then you're able to perform 2 inferences in parallel (first 8 cores will handle first request, next 8 cores will handle second request). Normally, if you don't set CPU_THORUGHPUT_STREAMS, you get one stream and all cores are handling one request (so requests are handled in sequence, one by one). When you have a lot of cores, then using all of them to handle the same request is not very effective (unless you only care about latency), so to achieve better throughput with such CPUs you can split cores into groups (streams) and make each of them handle its own request. This way you can effectively do multiple inferences at the same time. If you decide to use multiple streams, make sure nireq is more or equal to the number of streams. For your use case I encourage you to try with default settings (1 stream) , because to me it looks like you'll have trouble with providing enough data to the model to make multiple streams approach effective. From what I've observed, when model does not get enough data, using multiple streams performs worse than using single one.

This drop in performance for batch size bigger than 4 might be caused by using 8 cpu streams. Please, try with default plugin config (1 stream) and let me know about the findings 😃

Steedance commented 4 years ago

Thanks mzegla. Apologies, a few more questions based on what you have said here but it's very useful for my understanding. So the way that you talk it sounds that the stream almost reserves CPU threads for use. If that is the case then using the multiple model configuration from my initial comment here how would it behave if we were making requests for different models in parallel? For example sending 8 requests to person and face detection in parallel. Would they fight for CPU threads? Would it be better to reserve a subset of the total threads for each function

e.g.

as opposed to always specifying the max threads = 32.

Also why do you think that I will struggle to get data to my server? The code above is just a small testing piece but we are aiming for max throughput against model server here so if there's anything I might be missing then I would love to know. We are processing many many more images than the test script above.

w.r.t your final comment I ran a small number of tests on the benchmarking tool using command: benchmark_app.py -m /opt/intel/openvino/deployment_tools/open_model_zoo/tools/downloader/intel/person-detection-retail-0013/FP32/person-detection-retail-0013.xml -nstreams 1 -t 30

image

mzegla commented 4 years ago

I'm not sure, but I believe that it's not exactly that certain model reserves some number of cores. From my experience it looks more like they're assigned dynamically. I haven't tested OVMS that way (with multiple models using multiple streams), but I think that your example:

Would result in each model using at most 25% of all cores and that would mean you'll be having 4 streams for those two models, not 8 (4 per one). I think this plugin config takes into account all cores for each model. That's my understanding, but as I mentioned, I have not tested it myself so I might be wrong.

About providing enough data to the server. It all depends. Depends on the model, input size etc. OVMS does not always reproduce OpenVINO performance, especially for high throughput use cases. Model server itself can be incapable of handling certain amount of requests. What I mean by that is that OV might be able to perform more inferences than OVMS can process at a time. You don't see that problem when using benchmarking app (OV directly and locally). And for the amount of requests when model server starts to be the bottleneck, using multiple streams might be not that effective. Could you compare the results running tests on OVMS rather than OpenVINO and see if it still is better with multiple streams? Because maybe for your model it's okay.

Well it seems that even benchmarking app reports worse performance for bigger batch sizes, so it might be natural for that model.

Steedance commented 4 years ago

Hi,
Apologies for the late response. I managed to get an increased performance by using multiple clients so thanks for all of the help and information.