triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

Does Triton Server support Dynamic Request Batching for models which has sparse tensors as inputs #7333

Open MorrisMLZ opened 3 months ago

MorrisMLZ commented 3 months ago

I'm a SWE at LinkedIn ML infra. In fact, our team is investigating if we can somehow adopt Triton Server in our use of GPU. We have one question regarding to the dynamic batching capability of Triton Server is that If the batching support inference requests with sparse tensors? I'm asking because TF serving's request batching only supports dense tensors. If Triton could support sparse tensors, it would be a strong motivation for us to move to Triton.

nnshah1 commented 3 months ago

@MorrisMLZ - can you give more details / reference / examples to sparse tensors and what is required for support?

Is this a specific tensor format? Is it tensorflow specific?

@tanmayv25 - would our TensorFlow backend support sparse tensors?

MorrisMLZ commented 3 months ago

Yes, all of our models are TF models. And most of the inputs of the TF model are TF sparse tensors. We're investigating to use Triton to serve the TF model, and we want to know does Triton's TF backend support TF model with sparse tensors? If so, does it also support dynamic request batching for TF models with sparse tensors? Thanks!

tanmayv25 commented 3 months ago

Triton TF backend does not support SparseTensor yet. We can mark this as an enhancement ask.

ndeep27 commented 3 months ago

@tanmayv25 Are we saying Triton in general does not support SparseTensors? What about pytorch SparseTensor - https://pytorch.org/docs/stable/sparse.html?

ndeep27 commented 3 months ago

@tanmayv25 Also what about pytorch backend? Does it support Sparse Tensors? Or in general any of the backends support Sparse Tensors?

ndeep27 commented 3 months ago

cc: @nnshah1 If you can help on above two questions

tanmayv25 commented 3 months ago

Triton copies the binary blob of data from the client to the backend and backend in turns communicate it to the DL framework(TF, PyTorch). It does not try to interpret or attach meaning to the data blob in anyway. Hence, there is no restrictions in supporting SparseTensors in Triton. If one was to write a custom backend(or model.py in python backend) then that can consume the SparseTensor.

If the framework expect Tensor to be in a special format (read object) than its usual Tensor object, then the both TF and PyTorch would not support Sparse Tensors. This likely seems to be the case with what I have found online.

Currently we haven't tried using sparse tensors in any of our backends. Hence, marked it as an enhancement.

Adding @harryskim for assessing the priority of this ticket.