Open MorrisMLZ opened 3 months ago
@MorrisMLZ - can you give more details / reference / examples to sparse tensors and what is required for support?
Is this a specific tensor format? Is it tensorflow specific?
@tanmayv25 - would our TensorFlow backend support sparse tensors?
Yes, all of our models are TF models. And most of the inputs of the TF model are TF sparse tensors. We're investigating to use Triton to serve the TF model, and we want to know does Triton's TF backend support TF model with sparse tensors? If so, does it also support dynamic request batching for TF models with sparse tensors? Thanks!
Triton TF backend does not support SparseTensor yet. We can mark this as an enhancement ask.
@tanmayv25 Are we saying Triton in general does not support SparseTensors? What about pytorch SparseTensor - https://pytorch.org/docs/stable/sparse.html?
@tanmayv25 Also what about pytorch backend? Does it support Sparse Tensors? Or in general any of the backends support Sparse Tensors?
cc: @nnshah1 If you can help on above two questions
Triton copies the binary blob of data from the client to the backend and backend in turns communicate it to the DL framework(TF, PyTorch). It does not try to interpret or attach meaning to the data blob in anyway. Hence, there is no restrictions in supporting SparseTensors in Triton. If one was to write a custom backend(or model.py in python backend) then that can consume the SparseTensor.
If the framework expect Tensor to be in a special format (read object) than its usual Tensor object, then the both TF and PyTorch would not support Sparse Tensors. This likely seems to be the case with what I have found online.
Currently we haven't tried using sparse tensors in any of our backends. Hence, marked it as an enhancement.
Adding @harryskim for assessing the priority of this ticket.
I'm a SWE at LinkedIn ML infra. In fact, our team is investigating if we can somehow adopt Triton Server in our use of GPU. We have one question regarding to the dynamic batching capability of Triton Server is that If the batching support inference requests with sparse tensors? I'm asking because TF serving's request batching only supports dense tensors. If Triton could support sparse tensors, it would be a strong motivation for us to move to Triton.