mallorbc / whisper_mic

Project that allows one to use a microphone with OpenAI whisper.
MIT License
674 stars 154 forks source link

Issues fixed. Improved the listen() method. Added a new record() method. #47

Closed sankhadeepdutta closed 9 months ago

sankhadeepdutta commented 11 months ago

Fixes #46

Changes made:

  1. All the methods related to the implementation details of the WhisperMic class have been made private. Only the methods necessary for the user are public.

  2. Used the listen() method of Recognizer class from the speech_recognition package, instead of listen_in_background() method. It will make sure the listen() method of the WhisperMic class works as intended. It will wait for the user to complete his/her voice input and then take further action. If the user provides the values for the optional parameters timeout and phrase_time_limit, the listen() method will work accordingly. timeout: determines how long the listen() method will wait for audio input. phrase_time_limit: specifies how long the recognizer should wait for a spoken phrase to be completed. The default is “None” for both.

  3. [Important] A new method, __record_handler() has been added which implements the record() method of the Recognizer class in speech_recognition package. The key differences between the record() and listen() methods of the Recognizer class in the speech_recognition package are: record() captures audio for a fixed duration specified by the user. listen() captures audio in real-time and automatically detects the end of speech using silence detection Looking at the functionality, I believe the current implementation of the listen() method of the WhisperMic class wants to incorporate the record() method actually. So I have added a new record() method, which calls the __record_handler() method and transcribes the audio input. It has the optional parameters: duration and offset. Duration: specifies how long the recording should continue. Offset: specifies how long to wait before actually starting the recording. The default is “None” for both. Reference: https://github.com/Uberi/speech_recognition/blob/master/reference/library-reference.rst

  4. The record_callback() method has been renamed to __record_load() method whose task is to load the raw version of the recorded audio to the audio_queue queue.

  5. A new method, __listen_handler() has been added whose task is to handle the audio record process, load the audio data to the audio_queue queue, and finally transcribe the audio data. The output is stored in the result_queue queue. This method has been used both for implementing the listen() and the listen_loop() method.

6: Added exception handling for __listen_handler() method, as when the user sets a value for the “timeout” parameter in listen() method, it will raise an exception if the user doesn’t start speaking within the given time.

  1. The task of recording the audio from the mic has been shifted from the setup_mic() method to the __listen_handler() method. The setup_mic() method should be associated only to set up the mic properties. The program would start recording only after the user calls either the listen() or listen_loop() method.

  2. timeout, phrase_time_limit parameters has been added to the listen_loop() method as well.

mallorbc commented 10 months ago

Thanks! But there is an issue

Thanks for the PR! Sorry It took so long to review. I really like this however, with my testing, there are issues with the PR with regards to the listen_loop functionality.

If no phrase_timeout is given, it will not transcribe in "real time" anymore. That is ok and is to be expected. Giving a value to that argument mostly fixes the issue.

However, during the listen_loop method, during transcription, any words said during that time will be lost. One can easily see this by giving a phrase timeout value of say 2 and seeing that at the end of the two seconds, transcription will start and a word or two will be missed during this time. The current implementation on the main branch does not miss any words.

What I think the problem is

The reason for this I think can be attributed to the fact that the microphone is no longer listening in the background so during the code here: https://github.com/sankhadeepdutta/whisper_mic/blob/a0be2a17d11da075c561a9261de8df7ea839a27c/whisper_mic/whisper_mic.py#L88-L91

The microphone is no longer taking in audio data and has to wait for the transcription to finish before it starts listening again.

Other than this issue, I think that your implementation is great. For the simple listen method, it is superior. And of course, the comments, and cleaning up the code is much appreciated.

Ideas and next steps

I am going to make this a branch and work on it when I get a chance. At the same time, feel free to make additional changes to address these issues if you are so inclined.

I think using your method for the listen method is better as I already said, but either threading or listening in the background is needed for the real-time transcription.

Perhaps we use a variation of your solution for the listen method and a variation of the existing solution for the listen_loop method.

sankhadeepdutta commented 10 months ago

Thanks for checking out my PR. I understand the issue that you mentioned, I will try to fix it.

SunnyOd commented 10 months ago

Thanks guys I think this is just what I've been hoping for (not much of a coder!). I think what this means is the ability to record audio and process it as a block VS real-time which processes the audio stream live. Is that right?

Assuming what I said it true, might this new functionality make the transcription more accurate given that whisper is looking at interpreting a whole sentence instead of word by word? This seems to be the way the openai whisper API works, it seems more accurate as the transcription is delayed and the speech appears to be processed as a block

mallorbc commented 10 months ago

@sankhadeepdutta Any progress on a fix? If you don't have time for this, let me know and I will develop my own solution based on your work.

Thanks again for your PR!

sankhadeepdutta commented 9 months ago

@mallorbc sorry for the delay. Yeah actually I am occupied with a task on priority, so couldn't work on the fix. Hopefully the task will be complete in 3-4 days, then I can start working on the fix. Let me know if you have already started working on the fix.

mallorbc commented 9 months ago

@mallorbc sorry for the delay. Yeah actually I am occupied with a task on priority, so couldn't work on the fix. Hopefully the task will be complete in 3-4 days, then I can start working on the fix. Let me know if you have already started working on the fix.

I have not yet started on the fix.

sankhadeepdutta commented 9 months ago

@mallorbc I have started working on the fix, will push the changes once done.

sankhadeepdutta commented 9 months ago

@mallorbc I have fixed the listen_loop() method as suggested by you. Pls check it out.

mallorbc commented 9 months ago

Finally got around to reviewing this. It no longer cuts off text when listen_loop, but it does not process in real-time.

Run the current code on main with whiser_mic --loop --dictate and notice that it processes the spoken word as it is said.

This new implementation waits until you stop speaking.

I will try to see what the issue is and fix it.

Thanks again for your work!

mallorbc commented 9 months ago

I think it has to do with this line: https://github.com/sankhadeepdutta/whisper_mic/blob/5a25df20a6a84f03374ed4f68cf518304babe3d3/whisper_mic/whisper_mic.py#L147

Where phrase_time_limit defaults to None and the CLI passes nothing, so it is None. Gonna try passing a value from cli.py and see if that fixes it. If it does, the code is likely good to merge.

mallorbc commented 9 months ago

That was it. I will merge and do the quick fix after!

Thanks so much!