mallorbc / whisper_mic

Project that allows one to use a microphone with OpenAI whisper.
MIT License
705 stars 158 forks source link

Whisper_mic for faster-whisper/CTranslate2? #21

Closed ghost closed 8 months ago

ghost commented 1 year ago

I've found this script to be amazingly effective; it gives good real-time performance and accuracy, although I do wish the latency was a little lower (or that it could simply spit out smaller chunks of text more continuously).

It seems like faster-whisper is able to process the same audio much faster, and it might give even better real-time low-latency results than using stock whisper. I've been toying around with the code, but I'm very novice and clearly CTranslate2 uses different systems and parameters, so it keeps throwing errors when I try to point at it instead of whisper. Would you consider including support for faster-whisper to benefit from the massive performance improvements and lower memory usage?

mallorbc commented 1 year ago

I actually heard about faster-whisper recently. I plan on looking into it closer to other projects and may add support here.

ghost commented 1 year ago

@mallorbc > I actually heard about faster-whisper recently. I plan on looking into it closer to other projects and may add support here.

Well, I'm no coder, but I did seem to figure out rudimentarily how to get it working with faster-whisper – it pretty much works on the same principle, except that the output of faster-whisper isn't a dictionary like in normal whisper, but outputs a tuple that contains a generator segment and audioinfo. So, you just need to import "from faster_whisper import WhisperModel", and modify result to specify index 0 of the tuple for the generator:

 predicted_text = result[0]
            result_queue.put_nowait(predicted_text)

And then when it comes time to print it out, iterate over it to get the text field specifically:

        for segment in result_queue.get():
            finished_text = segment.text
            print(finished_text)

And that's really all it takes to get it going. It seems to work well, but I imagine you might have a more sophisticated solution. If not, maybe I could try to bang it together and do my first ever pull request :)

elia-ashraf commented 1 year ago

@mallorbc > I actually heard about faster-whisper recently. I plan on looking into it closer to other projects and may add support here.

Well, I'm no coder, but I did seem to figure out rudimentarily how to get it working with faster-whisper – it pretty much works on the same principle, except that the output of faster-whisper isn't a dictionary like in normal whisper, but outputs a tuple that contains a generator segment and audioinfo. So, you just need to import "from faster_whisper import WhisperModel", and modify result to specify index 0 of the tuple for the generator:

 predicted_text = result[0]
            result_queue.put_nowait(predicted_text)

And then when it comes time to print it out, iterate over it to get the text field specifically:

        for segment in result_queue.get():
            finished_text = segment.text
            print(finished_text)

And that's really all it takes to get it going. It seems to work well, but I imagine you might have a more sophisticated solution. If not, maybe I could try to bang it together and do my first ever pull request :)

Hey. Did you make a pull-request and do this? I was really hoping this worked with faster-whisper, which is definetly much better than the standard Whisper (or maybe with WhisperJAX).

DevenBL commented 1 year ago

i have tried crudely hammering this into the script but i cannot for the life of me get the

def transcribe_forever(audio_queue, result_queue, audio_model, english, verbose, save_file):
    while True:
        audio_data = audio_queue.get()
        if english:
            #result = audio_model.transcribe(audio_data,language='english')
            result, _ = audio_model.transcribe(audio_data,language='english')
        else:
            #result = audio_model.transcribe(audio_data)
            result, _ = audio_model.transcribe(audio_data)

        if not verbose:
            predicted_text = result[0]
            result_queue.put_nowait("You said: " + predicted_text)
        else:
            result_queue.put_nowait(result)

        if save_file:
            os.remove(audio_data)

Function working. it always dies at result, _ = audio_model.transcribe(audio_data) no clue what this magic syntax from the faster whisper documentation is supposed to be: , _

DeluxeMonster commented 1 year ago

I try to make a whisper llm bark bot so awesome repo just what i was looking for thanks mallorbc

the problem with faster-whisper is:

the iteration of the segments is when the model is actually running you cant leave out the iteration

segments, info = audio_model.transcribe(audio_data, without_timestamps=True) result="" for segment in segments: result+=segment.text

without_timestamps=True is much faster then normal print("Detected language '%s' with probability %f" % (info.language, info.languageprobability)) are some infos you can work with

evranch commented 8 months ago

The main issue was that faster-whisper doesn't want to be passed a Tensor. Got it working, way better performance.

mallorbc commented 8 months ago

I'm going to close this issue since it is now been merged into main.

Thanks for the PRs everybody!