triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

How to infer with sequence ? #50

Open monsterlyg opened 6 months ago

monsterlyg commented 6 months ago

It seems that the ModelClient does not support sequence inference. "sequence_start", "sequence_id", "sequence_end" can not be found in infer_sample/infer_batch.

piotrm-nvidia commented 6 months ago

Thank you for your interest in PyTriton and stateful models. I have experimented with sequence support in the client, but I realized that PyTriton does not pass sequence information to the model. This is because PyTriton is designed for stateless models that do not need sequence parameters.

Triton has a stateful backend that can handle sequences for some models, but PyTriton allows you to bind any Python function to Triton. You can implement any state and logic you want for your model in Python. You can also store the state of your model in script variables.

This may seem like a workaround, but it can offer more flexibility for some solutions to use simple binding and manage stateful logic in Python.

I would appreciate your feedback on this approach.

Can you tell me why your model requires a state?

Slyne commented 3 months ago

Thank you for your interest in PyTriton and stateful models. I have experimented with sequence support in the client, but I realized that PyTriton does not pass sequence information to the model. This is because PyTriton is designed for stateless models that do not need sequence parameters.

Triton has a stateful backend that can handle sequences for some models, but PyTriton allows you to bind any Python function to Triton. You can implement any state and logic you want for your model in Python. You can also store the state of your model in script variables.

This may seem like a workaround, but it can offer more flexibility for some solutions to use simple binding and manage stateful logic in Python.

I would appreciate your feedback on this approach.

Can you tell me why your model requires a state?

I can give a simple example: streaming ASR (doing inference for continuous audio chunks) and send responses continuously. For the workaround you mentioned above, I understand we can store the sequence_id in our codes to maintain the state. My question is will the sequence of requests from the same client be passed to the same compute instance? Just like what the sequence batcher in the triton inference server?

monsterlyg commented 3 months ago

I am not really using this stateful model, just in the process of reading the source code had a question, thank you for reply.

---Original--- From: "Slyne @.> Date: Wed, Mar 20, 2024 04:17 AM To: @.>; Cc: @.**@.>; Subject: Re: [triton-inference-server/pytriton] How to infer with sequence ?(Issue #50)

Thank you for your interest in PyTriton and stateful models. I have experimented with sequence support in the client, but I realized that PyTriton does not pass sequence information to the model. This is because PyTriton is designed for stateless models that do not need sequence parameters.

Triton has a stateful backend that can handle sequences for some models, but PyTriton allows you to bind any Python function to Triton. You can implement any state and logic you want for your model in Python. You can also store the state of your model in script variables.

This may seem like a workaround, but it can offer more flexibility for some solutions to use simple binding and manage stateful logic in Python.

I would appreciate your feedback on this approach.

Can you tell me why your model requires a state?

I can give a simple example: streaming ASR (doing inference for continuous audio chunks) and send responses continuously. For the workaround you mentioned above, I understand we can store the sequence_id in our codes to maintain the state. My question is will the sequence of requests from the same client be passed to the same compute instance? Just like what the sequence batcher in the triton inference server?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>