Distil Whisper models: Missing words and repetitions in transcription

go-run-jump commented 3 months ago

I've been experimenting with the distil whisper models as an alternative to the standard whisper models. While I was able to successfully integrate the distil models, I'm experiencing some issues with the transcription quality.

Current Behavior:

Approximately two sentences into the transcription, issues start to occur
Parts of sentences are missing from the transcription (not necessarily full sentences)
Sometimes some words or short phrases are transcribed multiple times
The pattern of missing or repeated content varies and is inconsistent
I have tried with VAD on and off. It doesn't change anything.

Expected Behavior:

Continuous, accurate transcription without missing words or repetitions
Performance similar to standard whisper models in terms of transcription quality

Additional Information:

The distil models are running about twice as fast as the standard models
When transcription occurs correctly, the quality seems to be on par with standard models
To add support for distil whisper models, the following lines need to be added to config_schema.yaml:
```
- distil-small.en
- distil-medium.en
- distil-large-v2.en
- distil-large-v3.en
```

Questions:

Has anyone else tested the distil whisper models and experienced these issues?
Are there any known factors that might be responsible for this inconsistent transcription behavior?
Are there any recommended settings or configurations that might help resolve these issues while maintaining the speed advantage of the distil models?

Any input or suggestions would be greatly appreciated, as the speed improvements of the distil models are significant.

Environment:

Operating System: Manjaro Linux with Gnome
Python version: 3.11
Branch: main (commit 71c03f663dc7475a0477701c852b51993570cb54)

Steps to Reproduce:

Add the distil whisper model options to config_schema.yaml as mentioned above
Select one of the distil whisper models in the settings
Attempt to use voice input for an extended period (more than two sentences)
Observe the resulting transcription for missing words and repetitions

dariox1337 commented 2 months ago

I've tested the distil-small model. It works without issues for me. Granted, I mostly use the "hold to record" mode, and therefore dictate one sentence at a time. I tried dictating a couple of sentences, and still didn't notice any issues. However, it feels weird not seeing what you say for a long time. Anyway, can you suggest a phrase that often fails to transcribe properly for you?

Recording...
Recording finished. Size: 260640 samples, Duration: 16.29 seconds
Transcribing...
Transcription completed in 0.51 seconds. Post-processed line:  I have been experimenting with the distilled whisper models as an alternative to the standard whisper models. While I was able to successfully integrate the distilled models, I am experiencing some issues with transcription quality.

NOTE: I'm using a heavily edited fork. So, consider my observations as related to the underlying libraries rather than WhisperWriter. You can try my fork, if you feel like it. https://github.com/dariox1337/whisper-writer To use distil models with this fork you can simply download a faster-distil-whisper model from HF, and set the folder in "model path".

go-run-jump commented 2 months ago

@dariox1337 I have identified what is responsible for the decreased quality of the distill whisper models. It seems that the distil whisper models are more susceptible to issues in the original audio. On my machine, which is running on Linux, the audio file that is produced by the library sounddevice is running faster than real time, skipping and having some flapping noises on top. I replaced sounddevice with pyaudio and after I did this the quality of distill whisper is just what you would expect. No issues. If this is happening unnoticed for more people (which is likely because the original whisper models seem to be very good at handling this and you can't select the distill models without changing the code) and thus reducing the quality of the transcriptions, it might be beneficial to replace sounddevice or find out what is responsible for its misbehavior.

go-run-jump commented 2 months ago

@dariox1337 Actually, it seems that this behavior is only happening in the fork you're having and suggesting to merge in #61 . Why the distill models don't work for me in state of the software now remains unclear.

dariox1337 commented 2 months ago

@go-run-jump as I said in the PR, the faster than real time audio might be because the sample rate isn't set correctly somewhere (I don't know where). Skipping and crackling is a mystery for me. I couldn't reproduce either of the issues.

Anyway, even though SoundDevice worked without issues for me, I rewrote the audio recording code to use PyAudio as well since it was already used for "beep on completion." The code is in the main branch of my fork.

savbell / whisper-writer