triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

How to send the byte or string data in array in perf analyzer #7526

Open Kanupriyagoyal opened 3 months ago

Kanupriyagoyal commented 3 months ago

Triton inference server:r24.07 and model_analyzer:1.42.0 config.pbtxt

backend: "python"
max_batch_size: 32 
input [
  {
    name: "IN0"
    data_type: TYPE_STRING
    dims: [ 16 ]
  }
]
output [
  {
    name: "OUT0"
    data_type: TYPE_FP64
    dims: [ 1 ]
  }
]

instance_group [
  {
    count:1
    kind: KIND_CPU
  }
]

dynamic_batching {
  max_queue_delay_microseconds: 2500
}

Tried with inference:

curl -v -X POST http://x.xx.xx.xx:80xx/v2/models/model_name/infer -H "Content-Type: application/json" -d '{
>     "inputs": [
>         {
>             "name": "IN0",
>             "shape": [1, 16],
>             "datatype": "BYTES",
>             "data": [
>                 ["0", "0", "2002", "9", "9", "9", "40", "19", "65.5", "Swipe Transaction", "-3345936507911876459", "La Verne", "CA", "91750", "7538", "Technical Glitch"]
>             ]
>         }
>     ],
>     "outputs": [
>         {
>             "name": "OUT0"
>         }
>     ]
> }'

< HTTP/1.1 200 OK
< Content-Type: application/json
< Content-Length: 117
< 
{"model_name":"model_name","model_version":"1","outputs":[{"name":"OUT0","datatype":"FP64","shape":[1],"data":[0.0]}]}

But when passing to perf analyzer as --input-data input.json where json looks like:

{
    "data": [
        {
            "IN0": {
                "content": [
                    ["17", "2", "2007", "6", "30", "16", "15", "0", "5.4", "Swipe Transaction", "-6571010470072147219", "Bloomville", "OH", "44818",  "5499", "Bad PIN"]
                ],
                "shape": [1,16],
                "datatype": "BYTES"
            }
        }
    ]
}

Getting error: Thread [0] had error: [request id: ] expected 16 string elements for inference input 'IN0', got 1 or error: Failed to init manager inputs: unable to find string data in json.

How need to pass string data?

nv-hwoo commented 3 months ago

hi @Kanupriyagoyal try this:

{
    "data": [
        {
            "IN0": {
                "content": ["17", "2", "2007", "6", "30", "16", "15", "0", "5.4", "Swipe Transaction", "-6571010470072147219", "Bloomville", "OH", "44818",  "5499", "Bad PIN"],
                "shape": [16]
            }
        }
    ]
}
Kanupriyagoyal commented 3 months ago

@nv-hwoo i tried

I0822 04:24:42.997591 111595 infer_handler.cc:975] "[request id: <id_unknown>] Infer failed: [request id: <id_unknown>] expected 16 string elements for inference input 'IN0', got 1"
I0822 04:24:42.997662 111595 infer_handler.h:1311] "Received notification for ModelInferHandler, 0"
I0822 04:24:42.997667 111595 infer_handler.cc:728] "Process for ModelInferHandler, rpc_ok=1, 0 step COMPLETE"
I0822 04:24:42.997685 111595 infer_handler.cc:728] "Process for ModelInferHandler, rpc_ok=1, 0 step FINISH"

input_suggested.json

{
    "data": [
        {
            "IN0": {
                "content": ["17", "2", "2007", "6", "30", "16", "15", "0", "5.4", "Swipe Transaction", "-6571010470072147219", "Bloomville", "OH", "44818",  "5499", "Bad PIN"],
                "shape": [16]
            }
        }
    ]
}
perf_analyzer -m xgb_model --service-kind=triton --model-repository=/models -b 1 -u localhost:8001 -i grpc -f xgb_model.csv  --verbose-csv --concurrency-range 1 --measurement-mode count_windows  --input-tensor-format json --input-data input_suggested.json --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: TRITON
  Using "count_windows" mode for stabilization
  Stabilizing using average latency and throughput
  Minimum number of samples in each window: 50
  Using synchronous calls for inference

Request concurrency: 1
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [0] had error: [request id: <id_unknown>] expected 16 string elements for inference input 'IN0', got 1
Hemaprasannakc commented 2 months ago

@nv-hwoo @Kanupriyagoyal

After some analysis, I identified that when we send the JSON input through HTTP to theperf_analyzer, it is interpreting the input format as binary by 'default'. The http_server.cc file in Triton contains specific logic to handle binary and byte data separately.

To resolve this, explicitly specify that the input format is JSON by using the following option:

--input-tensor-format json

This worked for me when my input is http - json and my count issue is resolved.

(make sure endianness of bytes handled well too)