triton-inference-server / onnxruntime_backend

The Triton backend for the ONNX Runtime.
BSD 3-Clause "New" or "Revised" License
122 stars 54 forks source link

Error in onnxruntime-openvino backend when run with Triton #89

Open mayani-nv opened 2 years ago

mayani-nv commented 2 years ago

Description The OnnxRt-Openvino backend produces the errors when ran with Triton. The error shows up when running the BERT onnx model from the zoo. However, when the same model is ran from the Jupyter notebook outside of Triton with OnnxRT-openvino backend it produces the desired outputs.

Triton Information Triton server container v21.10

Are you using the Triton container or did you build it yourself? - using container v21.10

To Reproduce

  1. Download the BERT onnx model from the onnx zoo

  2. The following is the config.pbtxt which uses the Openvino accelerator

    name: "bert_onnx_cpu_i0"
    platform: "onnxruntime_onnx"
    max_batch_size: 16
    input {
    name: "unique_ids_raw_output___9:0"
    data_type: TYPE_INT64
    dims: 1
    reshape {
    }
    }
    input {
    name: "segment_ids:0"
    data_type: TYPE_INT64
    dims: 256
    }
    input {
    name: "input_mask:0"
    data_type: TYPE_INT64
    dims: 256
    }
    input {
    name: "input_ids:0"
    data_type: TYPE_INT64
    dims: 256
    }
    output {
    name: "unstack:1"
    data_type: TYPE_FP32
    dims: 256
    }
    output {
    name: "unstack:0"
    data_type: TYPE_FP32
    dims: 256
    }
    output {
    name: "unique_ids:0"
    data_type: TYPE_INT64
    dims: 1
    reshape {
    }
    }
    instance_group {
    count: 2
    kind: KIND_CPU
    }
    dynamic_batching {
    preferred_batch_size: 2
    max_queue_delay_microseconds: 300
    }
    optimization {
    execution_accelerators {
    cpu_execution_accelerator {
      name: "openvino"
    }
    }
    }
  3. Run the perf_analyzer on the Triton hosted model and get the following error

    
    2021-12-06 20:30:49.669 INFO[perf_analyzer.py:258] Running perf_analyzer ['perf_analyzer', '-m', 'bert_onnx_cpu_i1', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '--measurement-interval', '10000', '--concurrency-range', '1', '--measurement-mode', 'time_windows'] failed with exit status 1 : *** Measurement Settings ***
    Batch size: 1
    Using "time_windows" mode for stabilization
    Measurement window: 10000 msec
    Using synchronous calls for inference
    Stabilizing using average latency

Request concurrency: 1 Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests. Thread [0] had error: onnx runtime error 6: Non-zero status code returned while running OpenVINO-EP-subgraph_5 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_1' Status Message: Cannot find blob with name: input_ids:0

tanmayv25 commented 2 years ago

@askhade Do you have any insights into the error? The error does make it look like an issue with ONNXRT/openvino integration issue but the model seems to work with python frontend of ONNX-RT with openvino EP.

askhade commented 2 years ago

From the error message it looks like it is unable to get the input "input_ids:0". Maybe some issue with input mapping not sure... needs investigation. How urgent is this?

mayani-nv commented 2 years ago

This experiment was done as a part of Model-analyzer integration with onnxruntime's OLIVE tool. The ask was to see how can the ORT hyper-parameters(backends, precision etc.) can be sweeped using MA

mayani-nv commented 2 years ago

@askhade I tried with Yolov2 onnx model and the Openvino backend seems to be working fine. It is only with the BERT onnx model that this error persists. Also, I tried to run with 'ORT-cpu only' backend for my BERT onnx model by commenting the following lines in my config.pbtxt

#optimization {
#  execution_accelerators {
#    cpu_execution_accelerator {
#      name: "openvino"
#    }
# }
#}

I get the error

docker run  -it --rm --net=host nvcr.io/nvidia/tritonserver:21.06-py3-sdk
root@AMLTritonTester:/workspace# perf_analyzer -m bert_onnx_cpu --concurrency-range 1:4
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnxruntime execute failure 2: Non-zero status code returned while running Gather node. Name:'bert/embeddings/GatherV2' Status Message: indices element out of data bounds, idx=-1420042007188224409 must be within the inclusive range [-30522,30521]
tanmayv25 commented 2 years ago

@mayani-nv BERT is data-sensitive model. perf_analyzer by default use random data to fill in tensors and model might not like that. You should be able to probably get it working by providing realistic data as json in perf_analyzer or providing - z like below: perf_analyzer -m bert_onnx_cpu --concurrency-range 1:4 -z

mayani-nv commented 2 years ago

@tanmayv25 thank you for the suggestion. So for the ort-cpu only backend, providing -z option helped and I am getting the following

/perf_analyzer -m bert_onnx_cpu -z  --concurrency-range 4
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 4
  Client: 
    Request count: 19
    Throughput: 3.8 infer/sec
    Avg latency: 1011611 usec (standard deviation 215023 usec)
    p50 latency: 1057326 usec
    p90 latency: 1312771 usec
    p95 latency: 1315162 usec
    p99 latency: 1315297 usec
    Avg HTTP time: 993732 usec (send/recv 60 usec + response wait 993672 usec)
  Server: 
    Inference count: 24
    Execution count: 19
    Successful request count: 19
    Avg request latency: 993315 usec (overhead 43 usec + queue 306508 usec + compute input 41 usec + compute infer 686683 usec + compute output 40 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 4, throughput: 3.8 infer/sec, latency 1011611 usec

However, doing the same with the ORt-openvino backend still gives the same error

 ./perf_analyzer -m bert_onnx_cpu -z  --concurrency-range 4
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 4
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: onnx runtime error 6: Non-zero status code returned while running OpenVINO-EP-subgraph_5 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_1' Status Message: Cannot find blob with name: input_ids:0
Thread [1] had error: onnx runtime error 6: Non-zero status code returned while running OpenVINO-EP-subgraph_2 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_2_1' Status Message: Cannot find blob with name: input_ids:0
Thread [2] had error: onnx runtime error 6: Non-zero status code returned while running OpenVINO-EP-subgraph_2 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_2_1' Status Message: Cannot find blob with name: input_ids:0
Thread [3] had error: onnx runtime error 6: Non-zero status code returned while running OpenVINO-EP-subgraph_5 node. Name:'OpenVINOExecutionProvider_OpenVINO-EP-subgraph_5_1' Status Message: Cannot find blob with name: input_ids:0
tanmayv25 commented 2 years ago

Yes.. IMO the openVINO error is not because of tensor data but openvino integration with ONNXRT.