Closed lebaudantoine closed 1 week ago
Discussion with Fun:
They use Softcatala/whisper-ctranslate2, which is the CLI of faster whisper. Whisper is used as a CLI by the Peer Tube runner, available through a package.
Everything run ONLY on CPU. It's not yet in production, still WIP. They use C2-30 machines from OVH (30Go Ram 8 vCore 200Go SSD)
A POC was deployed by the German team, accessible on a VM through ssh. It builds a transcription + summarization pipeline and exposes it though API.
Live streaming example to be investigated.
Deploy a Speech-to-Text model
These works are focused on Whisper.
What's whisper?
Whisper is a Transformer-based model developed by OpenAI, specializing in Speech-to-Text (STT) tasks, also known as Automatic Speech Recognition (ASR).
(source). For more information, you can explore the official Whisper page by OpenAI.
The latest release, Whisper-large-v3 on Hugging Face, is the most advanced version provided by the OpenAI team.
Hugging Face is a popular platform for sharing and collaborating on AI models. It functions as a repository where developers and researchers can publish, discover, and utilize machine learning models.
How does it work?
It takes an audio source, and processes the sound to identify and transcribe spoken words into text. The model was trained on either English-only data or multilingual data. It's capable of automatic language detection to ensure accurate transcription across multiple languages.
In Python, you can load the model directly from HF using their Transformers library. This library leverages state-of-the-art deep learning frameworks to load models and perform inferences efficiently. (it also supports training or fine-tuning pre-trained models, but that's out of scope).
PoC
I've wrapped a
whisper-large-v3
model in a FastAPI api, and deployed it to the cloud. My POC has a single endpoint/transcribe
which accepts a file. The file is copied to a temporary one in the container, and then processed by the Whisper model. The model's output is returned to the client.I was forced to add a visual interface to post data to
/transcribe
endpoint. Don't pay attention to the static files served by FastAPI they are definitely dirty.It takes roughly 8s with a small T4 GPU to process a 3-min-long audio.
The code is "quick-and-dirty", same goes for the commits' history; Everything is versioned to a repo managed by HF here. I enable the T4 hardware whenever I need to avoid paying too much.
Building the correct Docker image was quite challenging. The container needs to use the appropriate version of CUDA based on the available hardware, along with the corresponding version of PyTorch. Consequently, PyTorch is installed directly from its wheel rather than through the
requirements.txt
file.I added more layers than necessary to my Docker image to cache installations and troubleshoot the steps that were breaking. However, having so many layers is not a good practice.
My afternoon cost me only $0.13.
Hardware considerations
GPUs are the standard choice of hardware for machine learning because they are optimized for memory bandwidth and parallelism, unlike CPUs.
Running Transformer or AI models on a CPU often results in suboptimal performance, both in terms of accuracy and latency, as CPUs are not well-suited for the parallel computations required by these models, leading to slower inference times.
In contrast, GPUs excel at parallelism, providing significant acceleration during model inference due to their ability to handle multiple operations simultaneously. While GPUs are available through most cloud providers (such as AWS, GCP, etc.), they can be expensive. Additionally, securing GPUs on smaller private cloud providers, like Outscale, can be more challenging.
For my researches, I leveraged Hugging Face Spaces
Hugging Face "Spaces" allows you to build and deploy custom Docker containers to serve a FastAPI server. FastAPI is a rapidly growing framework for developing lightweight, asynchronous Python APIs, and is particularly popular for deploying AI models via simple APIs.
On Hugging Face Spaces, you can access various GPU types and pay on an hourly basis (e.g., small T4 GPUs are approximately $0.40/hour). This flexibility is excellent for prototyping and deploying APIs in the short term.
Looking ahead, our goal is to deploy our AI services on Kubernetes (K8s). To achieve this, we will need to discuss setting up new K8s nodes on virtual machines equipped with sufficiently powerful GPUs.
(Scaling AI models on K8s can be challenging, we should not underestimate it).
Inference optimizations
Two famous Whisper optimizations share the world:
While working on my POC, I've discovered recent works from the creator of insanely-fast-whisper, which offer new optimizations to try, especially using Distillation technics. Stay tuned.
Insanely-fast-whisper is based on Flash attention 2, which optimizes data manipulation in memory when making inference. Unfortunately, Flash Attention 2 is not implemented for all type of GPUs. It supports only Ampere ones (e.g. A100) and not Turing GPUs (e.g. T4, the cheapest one).
My next steps would be to deploy a distilled Whisper with the few optimizations recommended by Vaibhavs10 in his latest works.
Next steps
To-do: