jkawamoto / ctranslate2-rs

Rust bindings for OpenNMT/CTranslate2
https://docs.rs/ct2rs
MIT License
21 stars 3 forks source link

Add support for OpenAi Whisper Model #19

Closed bezaleel22 closed 3 months ago

bezaleel22 commented 7 months ago

I appreciate your work on this, can support for OpenAi Whisper Model be added. I'm new to rust and cxx, I can create a PR so you review and correct if you don't mind.

I want to use this library for a project but I need the whisper bindings

jkawamoto commented 6 months ago

Sorry for my late reply. Since I’ve been busy recently, I might be unable to add support for Whisper Models, but any PRs are welcome.

thewh1teagle commented 3 months ago

I'm also looking for using whisper with ctranslate2 in Rust. It should work 2/3x times faster than whisper.cpp Does the bindings ready or need to write them? That's why I asked about integration bindgen so we can easily access whole ctranslate2 functions

jkawamoto commented 3 months ago

Perhaps I can add StorageView and models.Whisper, but I’m not sure if that’s enough. This example uses transformers as the preprocessor, which is not available in Rust. Do you happen to know of an alternative preprocessor in Rust?

thewh1teagle commented 3 months ago

Do you happen to know of an alternative preprocessor in Rust?

I'm not aware of one, but perhaps tch-rs might be capable of doing it.

jkawamoto commented 3 months ago

From the docs, we need to make a Mel spectrogram of the input audio. I’m still looking for an appropriate way to do so, but the rest of the steps are implemented in the whisper branch (#68) and you can test it.

Here is a sample code:

Since we don’t have the preprocessor in Rust yet, we need to run this Python code:

import librosa
import numpy as np
import transformers

# Load and resample the audio file.
audio, _ = librosa.load("audio.wav", sr=16000, mono=True)

# Compute the features of the first 30 seconds of audio.
processor = transformers.WhisperProcessor.from_pretrained("openai/whisper-tiny")
inputs = processor(audio, return_tensors="np", sampling_rate=16000)

# Save the features.
np.save("features.npy", inputs.input_features)

The above code reads audio.wav and saves its preprocessed features into features.npy.

Then, this code generates the text from the features:

use anyhow::Result;
use ndarray::Array3;

use ct2rs::{auto, Tokenizer};
use ct2rs::storage_view::StorageView;
use ct2rs::whisper::Whisper;

fn main() -> Result<()> {
    // Read the features from the file.
    let mut array: Array3<f32> = ndarray_npy::read_npy("features.npy")?;

    let shape = array.shape().to_vec();
    let v = StorageView::new(&shape, array.as_slice_mut().unwrap(), Default::default())?;

    let model = Whisper::new("whisper-tiny-ct2", Default::default()).unwrap();
    let tokenizer = auto::Tokenizer::new("whisper-tiny-ct2")?;

    let lang = model.detect_language(&v)?;
    println!("Detected language: {:?}", lang[0][0]);

    let res = model.generate(
        &v,
        &vec![vec![
            "<|startoftranscript|>",
            &lang[0][0].language,
            "<|transcribe|>",
            "<|notimestamps|>",
        ]],
        &Default::default(),
    )?;

    for v in res[0].sequences.iter() {
        println!("{:?}", tokenizer.decode(v.clone()));
    }
}

The model file is converted with this command:

ct2-transformers-converter --model openai/whisper-tiny --output_dir whisper-tiny-ct2 --copy_files preprocessor_config.json
thewh1teagle commented 3 months ago

I tried it and got the error

(venv) ➜  sample git:(main) ls whisper-tiny-ct2 
config.json              model.bin                preprocessor_config.json vocabulary.json
(venv) ➜  sample git:(main) cargo run 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/sample`
Error: failed to create a tokenizer
(venv) ➜  sample git:(main) 

https://github.com/thewh1teagle/ct2rs-test

jkawamoto commented 3 months ago

I forgot you also need to download tokenizer.json from https://huggingface.co/openai/whisper-tiny/tree/main and put it in the same directory as model.bin.

jkawamoto commented 3 months ago

Added an example that doesn't require Python code.

https://github.com/jkawamoto/ctranslate2-rs/blob/whisper/examples/whisper.rs

thewh1teagle commented 3 months ago

Added an example that doesn't require Python code.

Working on Windows!

(venv) PS C:\Users\User\Documents\code\ctranslate2-rs> cargo run --example whisper whisper-tiny-ct2 .\multi.wav
warning: `C:\Users\User\.cargo\config` is deprecated in favor of `config.toml`
note: if you need to support cargo 1.38 or earlier, you can symlink `config` to `config.toml`
warning: `C:\Users\User\.cargo\config` is deprecated in favor of `config.toml`
note: if you need to support cargo 1.38 or earlier, you can symlink `config` to `config.toml`
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.30s
     Running `target\debug\examples\whisper.exe whisper-tiny-ct2 .\multi.wav`
Mel Spectrogram shape: [80, 3000]
Loaded model: Whisper { model: "whisper-tiny-ct2", multilingual: true, mels: 80, languages: 99, queued_batches: 0, active_batches: 0, replicas: 1 }
Detected language: DetectionResult { language: "<|en|>", probability: 0.9921079 }
Ok(" It's whoever, not whomever. No whomever is never actually right. No, sometimes it's right. Michael is right. It's a made-up word used to trick students. No. Actually, whomever is the formal version of the word. Obviously, it's a real word, but I don't know when to use it correctly. Not an A to speeder. I know what's right, but I'm not going to say because you're all jerks who didn't come see my band. You really know what's going to happen. I don't know. It's who when it's the object of the sentence and who when it's subject. That sounds right.")
Time taken: 7.0218621s

The only issue is the one I already opened in https://github.com/jkawamoto/ctranslate2-rs/issues/64 I'm thinking about integrating it into vibe Are there plans to add additional features to Whisper, such as new segment callbacks, abort callbacks, enabling translate settings, word timestamps, segment timestamps, temperature settings, initial prompt settings, progress callbacks etc...?

jkawamoto commented 3 months ago

64 is tricky. Maybe I should ask the CTranslate2 developers for help.

The generate function takes options so that you can customize settings such as temperature. However, CTranslate2 doesn’t support callbacks in the Whisper model, and there is no way to cancel the generation.

Timestamps should be supported by the model. For the openai/whisper-tiny model, you need to remove <|notimestamps|> from the prompt. (See also https://huggingface.co/openai/whisper-tiny#usage)

thewh1teagle commented 3 months ago

CTranslate2 doesn’t support callbacks in the Whisper model, and there is no way to cancel the generation.

I'm thinking about what's the best way to deal with that. Since sometimes users need to transcribe entire 1 hour in Vibe app, and they may want to cancel it. Maybe I'll use voice activity detector but I'm not sure if it's efficient to call the decoder / encoder on short sentences that the vad detected

jkawamoto commented 3 months ago

The docs say that even transcribing a 1h audio file, Whisper needs to split it into chunks.

This example only transcribes the first 30 seconds of audio. To transcribe longer files, you need to call generate on each 30-second window and aggregate the results. See the project faster-whisper for a complete transcription example using CTranslate2.

Using a voice activity detector sounds like a good idea.

thewh1teagle commented 2 months ago

Perhaps I can add StorageView and models.Whisper, but I’m not sure if that’s enough. This example uses transformers as the preprocessor, which is not available in Rust. Do you happen to know of an alternative preprocessor in Rust?

Looks like you solved it well but maybe it's still useful insight https://github.com/wavey-ai/mel-spec

jkawamoto commented 2 months ago

When I tried the library, it was broken. However, it looks like it has been fixed (https://github.com/wavey-ai/mel-spec/pull/10).

I’ll try it again. Thanks!