Closed bezaleel22 closed 3 months ago
Sorry for my late reply. Since I’ve been busy recently, I might be unable to add support for Whisper Models, but any PRs are welcome.
I'm also looking for using whisper with ctranslate2 in Rust. It should work 2/3x times faster than whisper.cpp Does the bindings ready or need to write them? That's why I asked about integration bindgen so we can easily access whole ctranslate2 functions
Perhaps I can add StorageView
and models.Whisper
, but I’m not sure if that’s enough. This example uses transformers
as the preprocessor, which is not available in Rust. Do you happen to know of an alternative preprocessor in Rust?
Do you happen to know of an alternative preprocessor in Rust?
I'm not aware of one, but perhaps tch-rs might be capable of doing it.
From the docs, we need to make a Mel spectrogram of the input audio. I’m still looking for an appropriate way to do so, but the rest of the steps are implemented in the whisper
branch (#68) and you can test it.
Here is a sample code:
Since we don’t have the preprocessor in Rust yet, we need to run this Python code:
import librosa
import numpy as np
import transformers
# Load and resample the audio file.
audio, _ = librosa.load("audio.wav", sr=16000, mono=True)
# Compute the features of the first 30 seconds of audio.
processor = transformers.WhisperProcessor.from_pretrained("openai/whisper-tiny")
inputs = processor(audio, return_tensors="np", sampling_rate=16000)
# Save the features.
np.save("features.npy", inputs.input_features)
The above code reads audio.wav and saves its preprocessed features into features.npy.
Then, this code generates the text from the features:
use anyhow::Result;
use ndarray::Array3;
use ct2rs::{auto, Tokenizer};
use ct2rs::storage_view::StorageView;
use ct2rs::whisper::Whisper;
fn main() -> Result<()> {
// Read the features from the file.
let mut array: Array3<f32> = ndarray_npy::read_npy("features.npy")?;
let shape = array.shape().to_vec();
let v = StorageView::new(&shape, array.as_slice_mut().unwrap(), Default::default())?;
let model = Whisper::new("whisper-tiny-ct2", Default::default()).unwrap();
let tokenizer = auto::Tokenizer::new("whisper-tiny-ct2")?;
let lang = model.detect_language(&v)?;
println!("Detected language: {:?}", lang[0][0]);
let res = model.generate(
&v,
&vec![vec![
"<|startoftranscript|>",
&lang[0][0].language,
"<|transcribe|>",
"<|notimestamps|>",
]],
&Default::default(),
)?;
for v in res[0].sequences.iter() {
println!("{:?}", tokenizer.decode(v.clone()));
}
}
The model file is converted with this command:
ct2-transformers-converter --model openai/whisper-tiny --output_dir whisper-tiny-ct2 --copy_files preprocessor_config.json
I tried it and got the error
(venv) ➜ sample git:(main) ls whisper-tiny-ct2
config.json model.bin preprocessor_config.json vocabulary.json
(venv) ➜ sample git:(main) cargo run
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
Running `target/debug/sample`
Error: failed to create a tokenizer
(venv) ➜ sample git:(main)
I forgot you also need to download tokenizer.json
from https://huggingface.co/openai/whisper-tiny/tree/main and put it in the same directory as model.bin
.
Added an example that doesn't require Python code.
https://github.com/jkawamoto/ctranslate2-rs/blob/whisper/examples/whisper.rs
Added an example that doesn't require Python code.
Working on Windows!
(venv) PS C:\Users\User\Documents\code\ctranslate2-rs> cargo run --example whisper whisper-tiny-ct2 .\multi.wav
warning: `C:\Users\User\.cargo\config` is deprecated in favor of `config.toml`
note: if you need to support cargo 1.38 or earlier, you can symlink `config` to `config.toml`
warning: `C:\Users\User\.cargo\config` is deprecated in favor of `config.toml`
note: if you need to support cargo 1.38 or earlier, you can symlink `config` to `config.toml`
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.30s
Running `target\debug\examples\whisper.exe whisper-tiny-ct2 .\multi.wav`
Mel Spectrogram shape: [80, 3000]
Loaded model: Whisper { model: "whisper-tiny-ct2", multilingual: true, mels: 80, languages: 99, queued_batches: 0, active_batches: 0, replicas: 1 }
Detected language: DetectionResult { language: "<|en|>", probability: 0.9921079 }
Ok(" It's whoever, not whomever. No whomever is never actually right. No, sometimes it's right. Michael is right. It's a made-up word used to trick students. No. Actually, whomever is the formal version of the word. Obviously, it's a real word, but I don't know when to use it correctly. Not an A to speeder. I know what's right, but I'm not going to say because you're all jerks who didn't come see my band. You really know what's going to happen. I don't know. It's who when it's the object of the sentence and who when it's subject. That sounds right.")
Time taken: 7.0218621s
The only issue is the one I already opened in https://github.com/jkawamoto/ctranslate2-rs/issues/64 I'm thinking about integrating it into vibe Are there plans to add additional features to Whisper, such as new segment callbacks, abort callbacks, enabling translate settings, word timestamps, segment timestamps, temperature settings, initial prompt settings, progress callbacks etc...?
The generate function takes options so that you can customize settings such as temperature. However, CTranslate2 doesn’t support callbacks in the Whisper model, and there is no way to cancel the generation.
Timestamps should be supported by the model. For the openai/whisper-tiny
model, you need to remove <|notimestamps|>
from the prompt. (See also https://huggingface.co/openai/whisper-tiny#usage)
CTranslate2 doesn’t support callbacks in the Whisper model, and there is no way to cancel the generation.
I'm thinking about what's the best way to deal with that. Since sometimes users need to transcribe entire 1 hour in Vibe app, and they may want to cancel it. Maybe I'll use voice activity detector but I'm not sure if it's efficient to call the decoder / encoder on short sentences that the vad detected
The docs say that even transcribing a 1h audio file, Whisper needs to split it into chunks.
This example only transcribes the first 30 seconds of audio. To transcribe longer files, you need to call generate on each 30-second window and aggregate the results. See the project faster-whisper for a complete transcription example using CTranslate2.
Using a voice activity detector sounds like a good idea.
Perhaps I can add
StorageView
andmodels.Whisper
, but I’m not sure if that’s enough. This example usestransformers
as the preprocessor, which is not available in Rust. Do you happen to know of an alternative preprocessor in Rust?
Looks like you solved it well but maybe it's still useful insight https://github.com/wavey-ai/mel-spec
When I tried the library, it was broken. However, it looks like it has been fixed (https://github.com/wavey-ai/mel-spec/pull/10).
I’ll try it again. Thanks!
I appreciate your work on this, can support for OpenAi Whisper Model be added. I'm new to rust and cxx, I can create a PR so you review and correct if you don't mind.
I want to use this library for a project but I need the whisper bindings