Closed lfcnassif closed 4 weeks ago
Just got a simple idea that can bring some speed up. Currently we are starting 2 transcription processes per GPU. I thought to use 3, but GPU used memory is already high, I see 20GB usage from 24GB. But maybe we can use some python threads to run 3 simultaneous transcriptions in the same python process, reusing the same model loaded on memory, instead of loading it for each process. GPU usage is already high, but maybe there is space for some speed up.
Running multiple transcriptions in inference batches is a common technique. But it would make the logic much more complex: we would have to group audios of similar duration, wait for them, maybe group audios from same client or from different ones, for how long time would we wait for more audios to put into the same group...
WhisperX uses batch inference (trasncription of many audio parts at the same time) to speed up transcription up to 10x on GPUs using just this technique. I think it is possible to change the WhisperX library to make it transcribe different audios at the same time using audio batches.
@hauck-jvsh, since you know and already contributed fixes and improvements to the transcription code, would you like to help improving WhisperX library (I just forked it in sepinf-inc repo) and improving IPED code to group audios of similar sizes before transcribing them? I think that would allow us to update our transcription service algorithm without the new hardware, which buying should take longer after the last government budget restrictions...
General recommendations: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
This one was found by @hauck-jvsh: https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/