Example of TensorRT-LLM Whisper backend for PyTriton

aleksandr-smechov commented 6 months ago

Describe the solution you'd like With the recent TensorRT-LLM support for Whipser, and now that PyTriton supports TensorRT-LLM, would be great to get examples of efficient client and server code, as well as decoupled mode examples.

Describe alternatives you've considered I've experimented with WhisperS2T coupled with FastAPI and PyTriton, and both perform well. It would be great to get a more involved example, like here and here.

piotrm-nvidia commented 6 months ago

Integrating support for Whisper using TensorRT-LLM with PyTriton for speech-to-text is an exciting challenge. However, PyTriton doesn't currently support streaming directly. For streaming audio, we might need a different setup, but PyTriton can still be a big part of the solution, especially for batch processing, which is more efficient than handling one piece at a time.

Since PyTriton excels in handling text generation but not so much with streaming audio like in a phone call, here's a simplified plan to make the most out of PyTriton for speech-to-text processing:

Create a Streaming Server: First, we need a server that collects audio from users. It waits until it has enough audio to make a full segment that's ready for processing.
Use PyTriton for Processing: Once we have a complete audio piece, we send it over to PyTriton. Here, the Whisper model, supercharged with TensorRT-LLM, kicks in to convert the speech into text. The goal is to take advantage of TensorRT-LLM’s speed and efficiency.
Get the Results Back to the User: After PyTriton does its magic, it sends the text back to the streaming server, which then sends it back to the user. We aim to keep this process quick to ensure users aren’t left waiting.

For this setup to work really well, especially in real-world scenarios, we might need to get a bit clever with how we handle the audio. Instead of sending everything we get straight away, implementing a system to detect when someone has stopped speaking can help us make sure we're only sending complete thoughts to PyTriton. This avoids cutting off sentences and improves accuracy.

To make this solution as good as it can be, knowing more about the kinds of conversations it'll be used for, the quality of audio we're expecting, and how people typically speak in these scenarios would be super helpful.

To refine the example further and ensure it aligns closely with practical deployment scenarios, additional details would be helpful:

Utterance Length Variability: Information on the expected range of utterance lengths and the variability of pause durations between utterances would assist in fine-tuning the pause detection algorithm.
Audio Quality and Format Considerations: Details on the expected audio quality and formats would guide the setup of audio preprocessing steps necessary before inference.
Real-World Use Case Scenarios: Understanding specific use cases, such as dictation, conversation transcription, or command and control applications, would help in tailoring the example to meet the needs of those scenarios.

Would you like to provide more details on these aspects to further refine the proposed solution and tailor it to your specific requirements?

aleksandr-smechov commented 6 months ago

Thanks for the detailed reply. It's definitely a fun challenge. I made an interim solution with our current API server, wordcab-transcribe, by integrating WhisperS2T as a whisper "engine". I'd mostly be interested in a batch/async setup with dynamic batching. Basically a server that's able to gracefully handle lots of incoming requests asynchronously. From what I understand this is already possible, if we're not considering streaming.

yuekaizhang commented 4 months ago

@aleksandr-smechov https://github.com/k2-fsa/sherpa/tree/master/triton/whisper Have you tried this triton python_backend + whisper trt-llm?

If @piotrm-nvidia would like to accept PR, I'd love to prepare a pytriton whisper tensorrt-llm recipe under pytriton/example.

aleksandr-smechov commented 4 months ago

@yuekaizhang Hey Yuekai, definitely familiar with your work. Yes, I've tried this, but eventually integrated WhisperS2T's version into our current API server, wordcab-transcribe since this is more flexible for us. This works well, but I'm still exploring the performance boost PyTriton could have over our current solution. Would be awesome to see what you have in mind with the PR.

piotrm-nvidia commented 3 months ago

@yuekaizhang, thank you for your proposal.

While I recognize the potential value of integrating your Whisper example with the PyTriton repository, there are some complexities related to the additional dependencies and components it requires. For inclusion in the PyTriton examples, we strive for configurations that can be easily set up with minimal environment adjustments. A criterion for acceptance is the ability to manage dependencies via pip install in nvcr.io/nvidia/pytorch or similar image without the need to construct a new Docker image.

If your implementation aligns with these guidelines, I'd be more than happy to consider it. Here’s how we manage our testing for new examples:

Each example should have include a METADATA variable in test.py within its test suite, specifying the Docker image used during testing. You can see an example here: Link to METADATA example.
Our CI executes a test.sh script for each example, which handles installations and environment setup via pip. The script should not require any steps that can’t be automated or require manual intervention. Refer to our existing script structure here for guidance: Link to test.sh example.

triton-inference-server / pytriton

Example of TensorRT-LLM Whisper backend for PyTriton #65