triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
719 stars 50 forks source link

Questions about new feature at 0.5.0 : decoupled model #58

Closed lionsheep0724 closed 7 months ago

lionsheep0724 commented 8 months ago

I have some questions about decoupled model in 0.5.0. Its documentation says that it is specifically useful in Automated Speech Recognition (ASR), but I don't understand why. Here's my questions.

  1. In my case, real time audio packet transmitted to fastapi server (keep alive connection with http) and the packet converted to audio feature when audio packet is accumulated enough to ASR. (maybe few seconds of packets or more) And then request to pytriton with extracted feature then get ASR response(text).
  2. In my scenario, how can I implement decoupled models and whats the advantage of it? I wonder if it guarantees inference with ordered-manner, w.r.t each audio source. I guess the audio packet source and fastapi server(which buffers packet and extract feature from it) should be 1:1, pytriton server and fastapi server will be 1:N, to handle multiple audio packet source with no audio sources mixed.
  3. Does pytriton with decoupled model can handle stream data? i.e. , can we feed audio packet (bytes) to server, directly?
  4. How can we control the reponse length? (the doc says server deliver response whenever it deems fit)
  5. How can we control parallelism? (number of workers, etc..) refer to doc, It can receive many requests in parallel and perform inference on each request independently.
lionsheep0724 commented 8 months ago

Hi do you have any news on this?

piotrm-nvidia commented 8 months ago

Triton Inference Server offers robust features for handling inference requests, and while it excels in certain areas, there are nuances to consider when dealing with specific use cases like streaming data. Let's clarify these aspects:

  1. Streaming Data vs. Batching:

    • Triton is designed to optimize GPU utilization primarily through batching. This means it's highly efficient when you have numerous small inference requests that can be batched together. However, for continuous, real-time data streams (like audio packets in ASR), the architecture of Triton might not be the most efficient out-of-the-box. Each request is treated as separate and independent, which might not align perfectly with the sequential and dependent nature of streaming audio data.
  2. ASR Specifics:

    • For ASR models, the integration might not be straightforward because the business logic for handling streaming data, especially the batching part, tends to be application-specific. For instance, batching a single second of audio from multiple input streams and processing them might require custom logic like silence detection. This should be handled at the application level and Triton can't do it.
  3. Decoupled Models and Streaming Outputs:

    • While Triton doesn’t handle streaming input data inherently, its decoupled models can stream output data. This means that as soon as the server has any part of the response ready, it can send it back, enabling scenarios where you want real-time interactions, like with LLMs producing text outputs progressively.
  4. Sequence Handling:

    • Triton supports sequence ID feature but it can't be used with PyTriton because it doesn't support it. It is hard to implement logic to merge several inference requests based on sequence IDs into single stream of audio features necessary for ASR. It is better to avoid this if there is no clear benefit in terms of GPU utilization or other performance metrics.
  5. Architecture Considerations:

    • The 1:N architecture (one audio source per FastAPI instance, multiple such instances interfacing with a single Triton server) can be beneficial. However, it's particularly advantageous when you have the need to batch multiple small inference requests to utilize the GPU effectively. If your application doesn't fit this pattern – for example, if you have fewer, more extensive inference requests – you might not see the same level of benefit. In such cases, directly interfacing with the model in FastAPI server or running multiple instances of the model (if you're dealing with single, large inference requests) might be more efficient.
  6. Internal Batching in Decoupled Models:

    • Although Triton doesn’t batch requests for decoupled models internally, it allows each request to call the model’s inference function independently. This opens up the possibility of implementing custom internal batching within your model's logic, tailored to your specific needs.

In conclusion, while Triton Inference Server provides robust features for batch processing and can handle decoupled model outputs effectively, integrating it with streaming data sources like ASR might require careful consideration and potentially custom application logic. It excels in scenarios with numerous small inference requests but might not be the most efficient for continuous, real-time data streams. As always, the best approach depends on the specific requirements and constraints of your use case.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been stalled for 7 days with no activity.