savbell / whisper-writer

💬📝 A small dictation app using OpenAI's Whisper speech recognition model.
GNU General Public License v3.0
298 stars 49 forks source link

Distil Whisper models: Missing words and repetitions in transcription #59

Open go-run-jump opened 4 weeks ago

go-run-jump commented 4 weeks ago

I've been experimenting with the distil whisper models as an alternative to the standard whisper models. While I was able to successfully integrate the distil models, I'm experiencing some issues with the transcription quality.

Current Behavior:

Expected Behavior:

Additional Information:

Questions:

  1. Has anyone else tested the distil whisper models and experienced these issues?
  2. Are there any known factors that might be responsible for this inconsistent transcription behavior?
  3. Are there any recommended settings or configurations that might help resolve these issues while maintaining the speed advantage of the distil models?

Any input or suggestions would be greatly appreciated, as the speed improvements of the distil models are significant.

Environment:

Steps to Reproduce:

  1. Add the distil whisper model options to config_schema.yaml as mentioned above
  2. Select one of the distil whisper models in the settings
  3. Attempt to use voice input for an extended period (more than two sentences)
  4. Observe the resulting transcription for missing words and repetitions
dariox1337 commented 3 weeks ago

I've tested the distil-small model. It works without issues for me. Granted, I mostly use the "hold to record" mode, and therefore dictate one sentence at a time. I tried dictating a couple of sentences, and still didn't notice any issues. However, it feels weird not seeing what you say for a long time. Anyway, can you suggest a phrase that often fails to transcribe properly for you?

Recording...
Recording finished. Size: 260640 samples, Duration: 16.29 seconds
Transcribing...
Transcription completed in 0.51 seconds. Post-processed line:  I have been experimenting with the distilled whisper models as an alternative to the standard whisper models. While I was able to successfully integrate the distilled models, I am experiencing some issues with transcription quality. 

NOTE: I'm using a heavily edited fork. So, consider my observations as related to the underlying libraries rather than WhisperWriter. You can try my fork, if you feel like it. https://github.com/dariox1337/whisper-writer To use distil models with this fork you can simply download a faster-distil-whisper model from HF, and set the folder in "model path".

go-run-jump commented 1 week ago

@dariox1337 I have identified what is responsible for the decreased quality of the distill whisper models. It seems that the distil whisper models are more susceptible to issues in the original audio. On my machine, which is running on Linux, the audio file that is produced by the library sounddevice is running faster than real time, skipping and having some flapping noises on top. I replaced sounddevice with pyaudio and after I did this the quality of distill whisper is just what you would expect. No issues. If this is happening unnoticed for more people (which is likely because the original whisper models seem to be very good at handling this and you can't select the distill models without changing the code) and thus reducing the quality of the transcriptions, it might be beneficial to replace sounddevice or find out what is responsible for its misbehavior.

go-run-jump commented 1 week ago

@dariox1337 Actually, it seems that this behavior is only happening in the fork you're having and suggesting to merge in #61 . Why the distill models don't work for me in state of the software now remains unclear.

dariox1337 commented 4 days ago

@go-run-jump as I said in the PR, the faster than real time audio might be because the sample rate isn't set correctly somewhere (I don't know where). Skipping and crackling is a mystery for me. I couldn't reproduce either of the issues.

Anyway, even though SoundDevice worked without issues for me, I rewrote the audio recording code to use PyAudio as well since it was already used for "beep on completion." The code is in the main branch of my fork.