Open AwokeKnowing opened 1 year ago
Hey @AwokeKnowing! Just to clarify, for the streaming application you have in mind, do you need the model to run online (i.e. start transcribing while the audio input is still being recorded), or just need a microphone demo of the model that can be run offline (wait till all the audio is recorded, then forward to the model)?
@sanchit-gandhi well the reality is people need it to run 'online' and latency in getting the text to an LLM is important, but I have not seen a standard effective solution for that (detecting end of utterance). So I strongly suggest you at least do a simple '1 second of silence' implementation which will appease the masses. Or possibly a 'hold the spacebar down while speaking' implementation.
Collabora has a live example using a websocket along with a client, browser extensions etc... I would love to see this adapted to work with distilled whisper. They are currently using faster whisper though so it would need to be adapted possibly. Right now it basically has a hard limit of 1 to 2 seconds which doesn't make it useful for real time word by word transcription.
If the test was able to transcribe in real time single words like hi, or hello it would definitely beat that implementation using faster whisper. I think silero vad can do .3 to .6 so if this can handle .1 that would definitely work well I imagine.
Also getting this setup to work with transformers.js would be pretty cool. You might already be aware, but it is already working with whisper, but it works with nodejs and in browser which is pretty useful.
well I don't think this repo needs to solve the mic to utterance latency problem. I hope a few general solutions emerge which can be plugged in to the STT models. (ai predicted end of utterance, as part of hf pipeline?)
I just checked the example you gave and it is very good. But for now, I think there should be an example where you are talking and the live transcription in showing, as this is the best way to evaluate the model in practice. It would be nice to have both a python and js version. I'm not sure if Collabora is set up to easily be adapted, but yeah that's exactly what is needed.
When people see a model like this, they want to clone it, run a script and start talking. If that is there, it's likely to be a go-to repo.
From the X/Twitter post I read I think they said 0.1 so this would already solve it. Right? Because currently with faster whisper anything below 1 second degrades accuracy.
If you converted the model to ctranslate2 then you should be able to use that solution although I haven't tried yet and @sanchit-gandhi would know better then I.
I'd be happy to implement it with the collabora code though if you believe it will work.
yeah that would be perfect. A huge boost for this repo if people can clone, run a setup, and start talking and seeing how it works we just webcam audio.
@AwokeKnowing - We also have a "nearly live" implementation that uses WebRTC from a browser to stream to our Willow Inference Server which uses ctranslate2 and a variety of other performance optimizations we have implemented. It's unique in a variety of ways:
Once the connection is open and ICE, etc has been negotiated we pause the audio track until you start recording. This allows for long-running sessions over days or longer that don't consume any bandwidth other than necessary bare-minimum exchanges to keep the session active in the browser, any NAT devices, and the inference server. We have users that have left their sessions open for weeks.
When you start recording we resume the track and begin streaming audio. When you stop recording the WebRTC audio track is paused again and the received audio server side is passed from buffer to whisper.
The results are sent back via the WebRTC datachannel and displayed in the browser.
You can watch the (old) demo video - which is actually slow compared to what we currently have implemented and it will only get faster with distil-whisper.
We just fixed some bugs in non-Chrome browsers that we haven't deployed yet but it you have Chrome and want to try it check it out.
Distil-whisper coming as soon as we get our hands on it!
Okay here's a quick PoC - works best if you have a GPU locally for faster inference. Also make sure you have the package FFmpeg installed so that you can listen to your device mic. Over the next few days, I'll make a Gradio demo out of this to showcase it end-to-end.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live
import sys
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-medium.en"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
transcriber = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
def transcribe(chunk_length_s=20.0, stream_chunk_s=1.0):
sampling_rate = transcriber.feature_extractor.sampling_rate
mic = ffmpeg_microphone_live(
sampling_rate=sampling_rate,
chunk_length_s=chunk_length_s,
stream_chunk_s=stream_chunk_s,
)
print("Start speaking...")
for item in transcriber(mic, generate_kwargs={"max_new_tokens": 128}):
sys.stdout.write("\033[K")
print(item["text"], end="\r")
if not item["partial"][0]:
break
return item["text"]
transcribe()
You can adjust the chunk length based on how real-time you need the transcription to be. Using a smaller stream_chunk_s
lends itself to more real-time speech recognition, since we divide our input audio into smaller chunks and transcribe them on the fly. However, this comes at the expense of poorer accuracy, since there’s less context for the model to infer from.
Note also that the function ffmpeg_microphone_live
has been reported to have some bugs on Windows/Mac. If it's not registering the microphone, there are some monkey patch solutions here: https://github.com/huggingface/transformers/issues/25183#issuecomment-1769312607
If you're running on CPU only, we might need a different implementation for this to work fast enough (e.g. in Rust).
@sanchit-gandhi How do you suggest a rust port is going to solve the inference speed issue on CPU only systems? I assumed the bottleneck is data processing through the model, which a rust port won't fix. Or perhaps you're suggesting utilizing multiple cores for some sort of parallelization? At that point, ctranslate2 should be sufficient?
distil-medium.en
works extremely well in particular with Whisper cpp: https://github.com/huggingface/distil-whisper#exporting-to-other-libraries
In a toy benchmark, it's about 4x faster than large-v2
. This is the best performing CPU-only export I've come across for Distil-Whisper so far!
if it helps, here is a quick FastAPI implementation I just did
from pydantic import BaseModel
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from io import BytesIO
import soundfile as sf
import librosa
app = FastAPI()
class TranscriptionResponse(BaseModel):
text: str
# Setup CUDA device and data type
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load the model and processor
model_id = "distil-whisper/distil-small.en"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
).to(device)
processor = AutoProcessor.from_pretrained(model_id)
# Define ASR pipelines for short and long-form processing
short_form_pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device
)
long_form_pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=25,
batch_size=16,
torch_dtype=torch_dtype,
device=device
)
@app.post("/transcribe", response_model=TranscriptionResponse)
async def transcribe_audio(file: UploadFile = File(...)):
try:
# Read and preprocess audio
audio_bytes = await file.read()
buffer = BytesIO(audio_bytes)
buffer.seek(0)
audio_input, sr_original = sf.read(buffer, dtype='float32')
if sr_original != 16000:
audio_input = librosa.resample(audio_input, orig_sr=sr_original, target_sr=16000)
duration = len(audio_input) / 16000 # Duration in seconds
if duration <= 30:
print("Using short-form processing")
results = short_form_pipe(audio_input)
else:
print("Using long-form processing")
results = long_form_pipe(audio_input)
transcription = results['text'] if 'text' in results else "No transcription found"
return TranscriptionResponse(text=transcription)
except Exception as e:
print(f"An error occurred: {str(e)}")
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run("app:app", host="0.0.0.0", port=8000, workers=4)
and tested it with this
def upload_file(file_path):
url = 'http://localhost:8000/transcribe'
with open(file_path, 'rb') as f:
files = {'file': (file_path, f)}
response = requests.post(url, files=files)
return response.text
file_path = /path/to/file.mp3'
response = upload_file(file_path)
print(response)
let's face it. these models are developed with static datasets but a primary use case is streaming audio transcription.
Please include a microphone-based demo (or suffer 1000 github issues begging for it, see other whisper repos)