sandrohanea / whisper.net

Whisper.net. Speech to text made simple using Whisper Models
MIT License
512 stars 78 forks source link

Using a microphone #80

Closed yakovw closed 7 months ago

yakovw commented 1 year ago

Is there a way in the library to use the microphone and not just transcribe an existing recording? because the original library has in whisper.cpp

sandrohanea commented 1 year ago

@adamnova added something like this to the demo in: https://github.com/sandrohanea/whisper.net/pull/9

However, I didn't wanted to add this Naudio dependency on the full demo, but since then, each example is done in a different project where there is no problem to have NAudio.

I think it makes sense to move it as a standalone example.

Also, for the best mic support, continuous recognition is also a must: https://github.com/sandrohanea/whisper.net/issues/25 Otherwise, transcript can be bad near "merging" segments.

I would also add some mic example for blazor, as that would be pretty cool.

adamnova commented 1 year ago

My demo was basically a proof of concept, it is not very usable in practice. Without the continuous recognition, all you get is somewhat repeating lines of text.

jbienz commented 9 months ago

It appears there is now continuous recognition here:

https://github.com/sandrohanea/whisper.net/tree/main/examples/ContinuousRecognition

Though it appears that's an example rather than part of core, is there a chance of getting a microphone sample now?

danroot commented 8 months ago

I was able to get realtime transcription from the mic working on my M1 Mac using the code below, which uses OpenTK.OpenAL. This is stitched together from various SO posts, and could be improved, but may be helpful to others looking to do similar. I ended up having to get the CoreML model manually, unzipping, and putting it in the current folder. Ideally IMO Whisper.net would "just work" and download this model when on apple silicon, similar to how it does the base .bin model.

The other "gotcha" I ran into was that I needed to specify a float[] buffer and ALFormat.MonoFloat32Ext capture.


     var modelName = "ggml-base.bin";
        //TODO: also https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base-encoder.mlmodelc.zip
        if (!File.Exists(modelName))
        {
            Console.WriteLine("Downloading whisper model...");
            using var modelStream = await WhisperGgmlDownloader.GetGgmlModelAsync(GgmlType.Base);
            using var fileWriter = File.OpenWrite(modelName);
            await modelStream.CopyToAsync(fileWriter);
        }
        using var whisperFactory = WhisperFactory.FromPath(modelName);

        using var processor = whisperFactory.CreateBuilder()
            .WithLanguage("en")
            .Build();
        int bufferLength = 10 * 16000;//10 sec
        var mic = ALC.CaptureOpenDevice(null, 16000, ALFormat.MonoFloat32Ext, bufferLength);
        Console.WriteLine("Using:");
        Console.WriteLine(ALC.GetString(new ALDevice(mic.Handle), AlcGetString.DeviceSpecifier));
        var currentInput = new StringBuilder();
        ALC.CaptureStart(mic);
        var buffer = new float[bufferLength];

        for (int i = 0; i < 100; ++i)
        {
            Thread.Sleep(1000);
            int samplesAvailable = ALC.GetAvailableSamples(mic);
            ALC.CaptureSamples(mic, buffer, samplesAvailable);

            if (samplesAvailable > 0)
            {            
               await foreach (var resultData in processor.ProcessAsync(buffer[..samplesAvailable]))
                {
                   Console.WriteLine("RAW:" + resultData.Text);   
                }
            }

        }
        ALC.CaptureStop(mic);
        ALC.CaptureCloseDevice(mic);

`
sandrohanea commented 7 months ago

Will close any issue related to streaming processing as linked to: https://github.com/sandrohanea/whisper.net/issues/25