Open gavrissh opened 1 year ago
@gavrishp Both envelopes needs fix for KServe 0.10 and v2 protocol has 2 examples one with bytes input and another with tensor input. Let me know what are you working on. I can take up the remaining.
@jagadeeshi2i I can take the v2 protocol changes.
There's one use case, I need inputs for. As Kserve supports batching within a single request. Request Example -
{
"inputs": [
{
"name": "input-0",
"shape": [37],
"datatype": "INT64",
"data": [66, 108, 111, 111, 109]
},
{
"name": "input-0",
"shape": [37],
"datatype": "INT64",
"data": [66, 108, 111, 111, 109]
}
]
}
Response for the above example is
{
"model_name":"resnet50",
"model_version":"3.0",
"id":"c0229ab0-f157-4917-974a-93646a51a57d",
"parameters":null,
"outputs":[
{
"name":"predict",
"shape":[],
"datatype":"BYTES",
"parameters":null,
"data":[2]
},
{
"name":"predict",
"shape":[],
"datatype":"BYTES",
"parameters":null,
"data":[2]
}
]
}
But with torchserve batching of multiple requests, as the handler postprocess output would return list of outputs. Might also need to hold some additional state to keep track of which input came from which request_id right?
In the above example a single http request has multiple inputs in it. So the response will have outputs with same order with request id. You are referring to Torchserve dynamic batching, which is not supported in KServe integration.
This issue is concerning the Torchserve dynamic batching with Kserve integration. Is there any particular reason for it not being supported? Is it planned to be supported in the future?
If that is the case, TS model config batch_size should not be allowed to be set more than 1 right now. I suppose it is causing this particular issue.
My understanding is that this batching will help with better GPU Utilisation and higher throughput values. My testing results supports this.
Torchserve with KServe has batching support. The inputs are statically batched. Torchserve on it own dynamic batching where it waits for batch_delay
time for batch_size
to be filled.
KServe v2 requires sending all inputs in a single request. Setting batch_size more than 1 here will make Torchserve wait for the batch_dealy
.
Regarding GPU utilization both static and dynamic batching starts processing after all the input in received so this will not affect the GPU uitilization.
Thanks for clarifying!
What would you suggest is the correct fix here for this issue?
batch_size
> 1 and TS service throws this error 'message': 'number of batch response mismatched'
as it did dynamic batching of multiple inputs.set batch_size
to 1
@gavrishp is the issue resolved now ?
@jagadeeshi2i Had a query, is it by design we are selecting only the first element in the batch in the kserve envelopes?
https://github.com/pytorch/serve/blob/master/ts/torch_handler/request_envelope/kserve.py#L27 https://github.com/pytorch/serve/blob/master/ts/torch_handler/request_envelope/kservev2.py#L102,L111
This is still voiding any use-case with batch_size > 1.
The main feature of torchserve is dynamic batching, especially if you have requests from multiple sources. It's a bummer that Kserve doesn't support that
🐛 Describe the bug
Torchserve supports batching of multiple requests and batch_size value is provided while registering the model.
Request Envelope receives the input as list of multiple request body but Kserve V2 request envelope picks only the first item in the list of inputs https://github.com/pytorch/serve/blob/master/ts/torch_handler/request_envelope/kservev2.py#L104
The result being a single output sent back as response causing the mismatch
Error logs
TorchServe Error stdout MODEL_LOG - model: resnet50-3, number of batch response mismatched, expect: 5, got: 1.
Installation instructions
Followed instructions provided here - https://github.com/pytorch/serve/blob/master/kubernetes/kserve/kserve_wrapper/README.md
Model Packaing
Created a resnet50.mar using default parameters and handler
config.properties
inference_address=http://0.0.0.0:8085/ management_address=http://0.0.0.0:8085/ metrics_address=http://0.0.0.0:8082/ grpc_inference_port=7075 grpc_management_port=7076 enable_envvars_config=true install_py_dep_per_model=true enable_metrics_api=true metrics_format=prometheus NUM_WORKERS=1 number_of_netty_threads=4 job_queue_size=10 model_store=/mnt/models/model_store model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"resnet50": {"1.0": {"defaultVersion": true,"marName": "resnet50.mar","minWorkers": 6,"maxWorkers": 6,"batchSize": 16,"maxBatchDelay": 200,"responseTimeout": 2000}}}}
Versions
Name: kserve Version: 0.10.0
Name: torch Version: 1.13.1+cu117
Name: torchserve Version: 0.7.1
Repro instructions
Followed instructions provided here - https://github.com/pytorch/serve/blob/master/kubernetes/kserve/kserve_wrapper/README.md
run the kserve_wrapper main.py and hit multiple curl infer request for v2 protocol
Command used - seq 1 10 | xargs -n1 -P 5 curl -H "Content-Type: application/json" --data @input_bytes.json http://0.0.0.0:8080/v2/models/resnet50/infer
Possible Solution
Changes required to handle Torchserve batched inputs and generate output for all the requests initiated by TorchServe
Changes are need in parse_input() and format_output() methods in kservev2.py