Closed yuzisun closed 8 months ago
@msaroufim Can we setup a call with you to discuss this ?
Hi @yuzisun sure! I'd be happy to, just forwarded this to the core team lemme get back to you with a few times that work
Might be easiest to email me its marksaroufim@meta.com
Quick suggestion - During my experiments I noticed that in torchserve with KServe v1, v2 REST protocol, we cannot use dynamic batching done by the Netty server. This is causing a performance difference as well compared to using raw inputs. Batching support would be ideal. Thanks!
Make sense, the goal of this issue is to totally remove the kserve wrapper and implement OIP natively with TorchServe. @gavrishp I think it is possible to just disable the kserve wrapper and send requests directly to Netty server using KServe v1, v2 REST protocol, can you try that ?
š The feature
KServe now has rebranded the v2 inference protocol to open inference protocol specification, can we implement OIP natively in torchserve like other model servers such as Triton, MLServer, OpenVino, AMD Inference Serve ?
Motivation, pitch
Currently torchserve utilizes the KServe python server in front of the TorchServe Netty server to adapt to the KServe v1, v2 REST protocol. However, I think this extra layer provides minimal value and cause numerous issues for maintenance and performance. The KServe python SDK is primarily designed for native python inference runtimes when user wants to implement arbitrary inference code with pre/post processing. Similarly, TorchServe provides comparable custom handlers. Therefore, there is no good reason why we need both and route all the kserve inference requests through
kserve python server -> Netty -> torchserve python worker
.Alternatives
No response
Additional context