openvinotoolkit / model_server

A scalable inference server for models optimized with OpenVINO™
https://docs.openvino.ai/2024/ovms_what_is_openvino_model_server.html
Apache License 2.0
679 stars 212 forks source link

high latency #2721

Open sriram-dsl opened 1 month ago

sriram-dsl commented 1 month ago

Latency in OpenVINO Model Server Inside Kubernetes Cluster

To Reproduce

Steps to reproduce the behavior:

  1. Prepare Models Repository: Followed standard procedures to set up the models repository.
  2. OVMS Launch Command:
    
    docker run -d --rm \
     -v /home/raa/model:/models \
     -p 900:9000 \
     -p 800:8000 \
     openvino/model_server:latest \
     --model_path /models/model1/1 \
     --model_name model1 \
     --port 9000 \
     --rest_port 8000 \
     --log_level DEBUG
  3. Client command :
       server address = 192.168.1.10:900
       client = make_grpc_client(server_address)
       pred = client.predict(inputs=inputs, model_name=model1)

Environment

I am using OpenVINO server for model deployment on edge devices with an Intel i7-13th Gen processor. My PyTorch model is trained with an image size of 640.

Expected Behavior

I expected the inference time for the OpenVINO Model Server inside the Kubernetes cluster to be comparable to that observed outside the cluster (i.e., <0.01 seconds).

model directory format

model/
├── model1/
│   └── 1/
│       ├── model.bin
│       ├── model.mapping
│       └── model.xml
mzegla commented 1 month ago

Could you share Kubernetes deployment configuration? I can see you are running with DEBUG log level - can you share logs from the server in your deployment with a few predictions logged?

sriram-dsl commented 1 month ago

logs :

 [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: OPTIMAL_NUMBER_OF_ _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: NUM_STREAMS, value _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: AFFINITY, value: C _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: INFERENCE_NUM_THRE _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: PERF_COUNT, value: _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: INFERENCE_PRECISIO _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: PERFORMANCE_HINT,  _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: EXECUTION_MODE_HIN _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: PERFORMANCE_HINT_N _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: ENABLE_CPU_PINNING _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: SCHEDULING_CORE_TY _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: ENABLE_HYPER_THREA _
_ [2024-10-08 09:51:34.769][1][modelmanager][debug][modelinstance.cpp:737] Model: model1; version: 1; target device: CPU, CompiledModel config key: EXECUTION_DEVICES, _
_ [2024-10-08 09:51:34.771][1][serving][info][modelinstance.cpp:793] Loaded model model1; version: 1; batch size: 1; No of InferRequests: 1                            _
_ [2024-10-08 09:51:34.771][1][modelmanager][debug][modelinstance.cpp:870] Is model loaded from cache: false                                                           _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelversionstatus.cpp:88] setAvailable: model1 - 1 (previous state: LOADING) -> error: OK                              _
_ [2024-10-08 09:51:34.771][1][serving][info][modelversionstatus.cpp:109] STATUS CHANGE: Version 1 of model model1 status change. New status: ( "state": "AVAILABLE",  _
_ [2024-10-08 09:51:34.771][1][serving][info][model.cpp:88] Updating default version for model: model1, from: 0                                                        _
_ [2024-10-08 09:51:34.771][1][serving][info][model.cpp:98] Updated default version for model: model1, to: 1                                                           _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:664] Specified model parameters:               
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:665] model_basepath: /models/model2                                                                     _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:666] model_name: model2                                                                                 _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:667] batch_size: not configured                                                                         _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:671] shape:                                                                                             _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:677] model_version_policy: latest: 1                                                                    _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:679] nireq: 0                                                                                           _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:680] target_device: CPU                                                                                 _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:681] plugin_config:                                                                                     _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:683]   PERFORMANCE_HINT: LATENCY                                                                        _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:689] Batch size set: false, shape set: false                                                            _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:696] stateful: false                                                                                    _
_ [2024-10-08 09:51:34.771][1][modelmanager][debug][ov_utils.cpp:90] Validating plugin: CPU; configuration                                                             _
_ [2024-10-08 09:51:34.771][1][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model2                                                    _
_ [2024-10-08 09:51:34.771][1][modelmanager][debug][modelmanager.cpp:737] Cannot reload model: model2 with versions due to error: The provided base path is invalid or _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:664] Specified model parameters:                                                                        _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:665] model_basepath: /models/model5                                                                     _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:666] model_name: model5                                                                                 _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:667] batch_size: not configured                                                                         _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:671] shape:                                                                                             _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:677] model_version_policy: latest: 1                                                                    _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:679] nireq: 0                                                                                           _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:680] target_device: CPU                                                                                 _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:681] plugin_config:                                                                                     _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:683]   PERFORMANCE_HINT: LATENCY                                                                        _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:689] Batch size set: false, shape set: false                                                            _
_ [2024-10-08 09:51:34.771][1][serving][debug][modelconfig.cpp:696] stateful: false                                                                                    _
_ [2024-10-08 09:51:34.771][1][modelmanager][debug][ov_utils.cpp:90] Validating plugin: CPU; configuration 
_ [2024-10-08 09:51:34.771][1][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model5                                                    _
_ [2024-10-08 09:51:34.771][1][modelmanager][debug][modelmanager.cpp:737] Cannot reload model: model5 with versions due to error: The provided base path is invalid or _
_ [2024-10-08 09:51:34.771][1][modelmanager][error][modelmanager.cpp:772] Loading main OVMS config models failed.                                                      _
_ [2024-10-08 09:51:34.771][1][modelmanager][info][modelmanager.cpp:542] Configuration file doesn't have custom node libraries property.                               _
_ [2024-10-08 09:51:34.771][1][modelmanager][info][modelmanager.cpp:585] Configuration file doesn't have pipelines property.                                           _
_ [2024-10-08 09:51:34.771][1][serving][info][servablemanagermodule.cpp:44] ServableManagerModule started                                                              _
_ [2024-10-08 09:51:34.771][72][modelmanager][info][modelmanager.cpp:1103] Started cleaner thread                                                                      _
_ [2024-10-08 09:51:34.771][71][modelmanager][info][modelmanager.cpp:1085] Started model manager thread                                                                _
_ [2024-10-08 09:51:34.928][63][serving][debug][kfs_grpc_inference_service.cpp:206] GetModelMetadata requested model: name model1; version 1                           _
_ [2024-10-08 09:51:34.928][63][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:34.941][63][serving][debug][kfs_grpc_inference_service.cpp:238] Processing gRPC request for model: model1; version: 1                              _
_ [2024-10-08 09:51:34.941][63][serving][debug][modelmanager.cpp:1550] Requesting model: model1; version: 1.                                                           _
_ [2024-10-08 09:51:34.941][63][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:34.941][63][serving][debug][modelinstance.cpp:1197] Getting infer req duration in model model1, version 1, nireq 0: 0.005 ms                       _
_ [2024-10-08 09:51:34.941][63][serving][debug][modelinstance.cpp:1205] Preprocessing duration in model model1, version 1, nireq 0: 0.000 ms                           _
_ [2024-10-08 09:51:34.941][63][serving][debug][modelinstance.cpp:1215] Deserialization duration in model model1, version 1, nireq 0: 0.052 ms                         _
_ [2024-10-08 09:51:35.771][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model5                                                   _
_ [2024-10-08 09:51:35.771][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model2                                                   _
_ [2024-10-08 09:51:35.828][64][serving][debug][kfs_grpc_inference_service.cpp:206] GetModelMetadata requested model: name model1; version 1                           _
_ [2024-10-08 09:51:35.828][64][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:35.843][64][serving][debug][kfs_grpc_inference_service.cpp:238] Processing gRPC request for model: model1; version: 1                              _
_ [2024-10-08 09:51:35.843][64][serving][debug][modelmanager.cpp:1550] Requesting model: model1; version: 1.                                                           _
_ [2024-10-08 09:51:35.843][64][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:36.368][63][serving][debug][modelinstance.cpp:1223] Prediction duration in model model1, version 1, nireq 0: 1426.586 ms                           _
_ [2024-10-08 09:51:36.368][63][serving][debug][modelinstance.cpp:1232] Serialization duration in model model1, version 1, nireq 0: 0.209 ms                           _
_ [2024-10-08 09:51:36.368][63][serving][debug][modelinstance.cpp:1240] Postprocessing duration in model model1, version 1, nireq 0: 0.000 ms                          _
_ [2024-10-08 09:51:36.371][64][serving][debug][modelinstance.cpp:1197] Getting infer req duration in model model1, version 1, nireq 0: 528.159 ms                     _
_ [2024-10-08 09:51:36.371][64][serving][debug][modelinstance.cpp:1205] Preprocessing duration in model model1, version 1, nireq 0: 0.000 ms 
_ [2024-10-08 09:51:36.371][64][serving][debug][modelinstance.cpp:1215] Deserialization duration in model model1, version 1, nireq 0: 0.054 ms                         _
_ [2024-10-08 09:51:36.371][63][serving][debug][kfs_grpc_inference_service.cpp:258] Total gRPC request processing time: 1430.05 ms                                     _
_ [2024-10-08 09:51:36.383][75][serving][debug][kfs_grpc_inference_service.cpp:206] GetModelMetadata requested model: name model1; version 1                           _
_ [2024-10-08 09:51:36.383][75][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:36.395][75][serving][debug][kfs_grpc_inference_service.cpp:238] Processing gRPC request for model: model1; version: 1                              _
_ [2024-10-08 09:51:36.395][75][serving][debug][modelmanager.cpp:1550] Requesting model: model1; version: 1.                                                           _
_ [2024-10-08 09:51:36.395][75][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:36.771][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model5                                                   _
_ [2024-10-08 09:51:36.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model2                                                   _
_ [2024-10-08 09:51:37.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model5                                                   _
_ [2024-10-08 09:51:37.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model2                                                   _
_ [2024-10-08 09:51:37.775][64][serving][debug][modelinstance.cpp:1223] Prediction duration in model model1, version 1, nireq 0: 1404.487 ms                           _
_ [2024-10-08 09:51:37.776][64][serving][debug][modelinstance.cpp:1232] Serialization duration in model model1, version 1, nireq 0: 0.187 ms                           _
_ [2024-10-08 09:51:37.776][64][serving][debug][modelinstance.cpp:1240] Postprocessing duration in model model1, version 1, nireq 0: 0.000 ms                          _
_ [2024-10-08 09:51:37.776][64][serving][debug][kfs_grpc_inference_service.cpp:258] Total gRPC request processing time: 1933.02 ms                                     _
_ [2024-10-08 09:51:37.776][75][serving][debug][modelinstance.cpp:1197] Getting infer req duration in model model1, version 1, nireq 0: 1381.122 ms                    _
_ [2024-10-08 09:51:37.776][75][serving][debug][modelinstance.cpp:1205] Preprocessing duration in model model1, version 1, nireq 0: 0.000 ms                           _
_ [2024-10-08 09:51:37.776][75][serving][debug][modelinstance.cpp:1215] Deserialization duration in model model1, version 1, nireq 0: 0.039 ms                         _
_ [2024-10-08 09:51:37.793][64][serving][debug][kfs_grpc_inference_service.cpp:206] GetModelMetadata requested model: name model1; version 1                           _
_ [2024-10-08 09:51:37.793][64][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:37.802][64][serving][debug][kfs_grpc_inference_service.cpp:238] Processing gRPC request for model: model1; version: 1                              _
_ [2024-10-08 09:51:37.802][64][serving][debug][modelmanager.cpp:1550] Requesting model: model1; version: 1.                                                           _
_ [2024-10-08 09:51:37.802][64][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:38.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model5                                                   _
_ [2024-10-08 09:51:38.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model2                                                   _
_ [2024-10-08 09:51:39.181][75][serving][debug][modelinstance.cpp:1223] Prediction duration in model model1, version 1, nireq 0: 1404.577 ms                           _
_ [2024-10-08 09:51:39.181][75][serving][debug][modelinstance.cpp:1232] Serialization duration in model model1, version 1, nireq 0: 0.243 ms                           _
_ [2024-10-08 09:51:39.181][75][serving][debug][modelinstance.cpp:1240] Postprocessing duration in model model1, version 1, nireq 0: 0.000 ms  
_ [2024-10-08 09:51:39.181][75][serving][debug][kfs_grpc_inference_service.cpp:258] Total gRPC request processing time: 2786.09 ms                                     _
_ [2024-10-08 09:51:39.182][64][serving][debug][modelinstance.cpp:1197] Getting infer req duration in model model1, version 1, nireq 0: 1379.621 ms                    _
_ [2024-10-08 09:51:39.182][64][serving][debug][modelinstance.cpp:1205] Preprocessing duration in model model1, version 1, nireq 0: 0.000 ms                           _
_ [2024-10-08 09:51:39.182][64][serving][debug][modelinstance.cpp:1215] Deserialization duration in model model1, version 1, nireq 0: 0.050 ms                         _
_ [2024-10-08 09:51:39.196][63][serving][debug][kfs_grpc_inference_service.cpp:206] GetModelMetadata requested model: name model1; version 1                           _
_ [2024-10-08 09:51:39.196][63][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:39.209][73][serving][debug][kfs_grpc_inference_service.cpp:238] Processing gRPC request for model: model1; version: 1                              _
_ [2024-10-08 09:51:39.209][73][serving][debug][modelmanager.cpp:1550] Requesting model: model1; version: 1.                                                           _
_ [2024-10-08 09:51:39.209][73][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:39.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model5                                                   _
_ [2024-10-08 09:51:39.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model2                                                   _
_ [2024-10-08 09:51:40.585][64][serving][debug][modelinstance.cpp:1223] Prediction duration in model model1, version 1, nireq 0: 1403.511 ms                           _
_ [2024-10-08 09:51:40.585][64][serving][debug][modelinstance.cpp:1232] Serialization duration in model model1, version 1, nireq 0: 0.186 ms                           _
_ [2024-10-08 09:51:40.585][64][serving][debug][modelinstance.cpp:1240] Postprocessing duration in model model1, version 1, nireq 0: 0.000 ms                          _
_ [2024-10-08 09:51:40.586][64][serving][debug][kfs_grpc_inference_service.cpp:258] Total gRPC request processing time: 2783.48 ms                                     _
_ [2024-10-08 09:51:40.586][73][serving][debug][modelinstance.cpp:1197] Getting infer req duration in model model1, version 1, nireq 0: 1376.769 ms                    _
_ [2024-10-08 09:51:40.586][73][serving][debug][modelinstance.cpp:1205] Preprocessing duration in model model1, version 1, nireq 0: 0.000 ms                           _
_ [2024-10-08 09:51:40.586][73][serving][debug][modelinstance.cpp:1215] Deserialization duration in model model1, version 1, nireq 0: 0.056 ms                         _
_ [2024-10-08 09:51:40.602][63][serving][debug][kfs_grpc_inference_service.cpp:206] GetModelMetadata requested model: name model1; version 1                           _
_ [2024-10-08 09:51:40.602][63][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:40.612][63][serving][debug][kfs_grpc_inference_service.cpp:238] Processing gRPC request for model: model1; version: 1                              _
_ [2024-10-08 09:51:40.612][63][serving][debug][modelmanager.cpp:1550] Requesting model: model1; version: 1.                                                           _
_ [2024-10-08 09:51:40.612][63][serving][debug][modelinstance.cpp:1019] Model: model1, version: 1 already loaded                                                       _
_ [2024-10-08 09:51:40.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model5                                                   _
_ [2024-10-08 09:51:40.772][71][modelmanager][error][modelmanager.cpp:1340] Directory does not exist: /models/model2 
mzegla commented 1 month ago

Okay, from the logs it looks like prediction takes ~1.4 s. Are you sure there are no resource constrains for your deployment? If standalone server and K8S deployment run on the same machine then it looks like server running in kubernetes cluster is using a fraction on resources available to standalone instance.

sriram-dsl commented 1 month ago

I have been using this OpenVINO server for the last two years without any issues. However, we started encountering this problem in the most recent deployments, even though we are using the same resources as in previous deployments.

this is the resource column for the deployment :

  # Resource Configuration
  metaResources:
      resources:
        limits:
          cpu: "1"                     # Low resource limits for CPU
          memory: "2G"                 # Low resource limits for memory
        requests:
          cpu: "1"                     # Low resource requests for CPU
          memory: "2G"                 # Low resource requests for memory

  serviceName: inference-service1     # Name of the inference service
  serviceType: ClusterIP              # Service type for the inference service
  port: 9000                          # Port for the inference service
  grpcWorkers: 2                      # Number of gRPC workers
  labels:
    app: inference1                   # Labels for identifying the application
mzegla commented 1 month ago

Your standalone server does not have CPU limitations, but your Kubernetes deployment does. Please try removing CPU from limits and requests in resources field and check if performance gets closer to standalone instance.

sriram-dsl commented 1 month ago

Sorry for the delayed response. We have tried it, but it didn't work. Now, this issue is occurring in every deployment. We are running inference containers outside the cluster.

Could you help us understand what might be causing this? It seems like there could be issues in other parts of the platform as well, right? However, from the logs, it's clear that the server itself is taking more than 2 seconds for each inference

mzegla commented 1 month ago

Now, this issue is occurring in every deployment. We are running inference containers outside the cluster.

So you observe the same issue outside Kubernetes?

Could you help us understand what might be causing this? It seems like there could be issues in other parts of the platform as well, right? However, from the logs, it's clear that the server itself is taking more than 2 seconds for each inference

From the logs I would say that only a fraction of CPU is used, like there's some limitation set. I don't know the details of your platform, but you can try to run it in isolation and send request manually to make sure it behaves the same.

sriram-dsl commented 1 month ago

Outside of Kubernetes, it is working fine. The previous deployments done on Kubernetes are facing the issue. I will try removing all limitations inside the cluster.