Generated subtitles are too long

joshuachough commented 3 months ago

Which OS are you using?

OS: MacOS Sonoma 14.3.1

I am trying to translate korean audio files and the generation works, but I often find that the subtitles generated are too long. For example, one subtitle that lasts 12 seconds should ideally be split into 6 subtitles that each last 2 seconds.

I saw #158 and tried fiddling with the VAD parameters (running the default implementation using faster-whisper), but it doesn't seem to change anything.

Are there any other parameters I can fiddle with or anything else I can do to achieve this?

jhj0517 commented 3 months ago

Hi @joshuachough thanks for reporting. I just found out that there was a critical bug that VAD didn't apply to audio, It's fixed in #206.

And according to faster-whisper #452, there's no option to tune such length of each segment in the Whisper model.

So the best thing we can do might be to tune the VAD parameters so that faster-whisper can detect better pre-processed audio. But it's all about pre-processing, so there would be a limit to what it can do.

For your usecase, trying to decrease "Minimum Silence Duration (ms)" and increasing "Speech Padding (ms)" would help.

joshuachough commented 3 months ago

Hi @joshuachough thanks for reporting. I just found out that there was a critical bug that VAD didn't apply to audio, It's fixed in #206.

And according to faster-whisper #452, there's no option to tune such length of each segment in the Whisper model.

So the best thing we can do might be to tune the VAD parameters so that faster-whisper can detect better pre-processed audio. But it's all about pre-processing, so there would be a limit to what it can do.

For your usecase, trying to decrease "Minimum Silence Duration (ms)" and increasing "Speech Padding (ms)" would help.

Hi @jhj0517! I tried running it again with VAD and tried tuning the parameters, but nothing changed my result. I did some poking around and realized that the run function in the silero_vad module does detect the chunks with voice activity, but ends up combining all these chunks into one combined audio stream anyways using self.collect_chunks. As a result, I keep getting the same long subtitle result as shown in the original issue.

So, I tried to rewrite the run function to separate the audio into separate clips corresponding to the detected chunks and then run each of these clips through the faster-whisper model. You can see my commit on my fork. This worked!

However, I began to notice another problem. In all of my generated subtitles, I appear to get these weird quantization effect, where the subtitles are output with a duration not to the exact length of the actual dialogue chunk but rather to an even number. As you can see in the screenshot above, the first, second, and fourth subtitles are all exactly 2 seconds long despite each not actually being 2 seconds long of dialogue. I believe this bug is originating from the faster-whisper model itself, but I am unsure if it is a product of the parameters being fed to the model or a fundamental limitation of the model itself.

I would love to hear your thoughts!

jhj0517 commented 3 months ago

Hi @joshuachough. I recently found that VAD was incorrectly implemented (There were more bugs than in the earlier one...), and it's fixed in #216.

VAD works by first removing the non-speech parts from the audio. Then, after transcribing the whole audio with Whisper, it "restores" the original timestamps using the VAD result and the transcription result.

This is how VAD works in faster-whisper and I followed it in exactly the same way.

For the code about it, you can see restore_speech_timestamps() function.

So if you pull latest update from this repository then everything should work as intended. Can you check with the latest updates?

jhj0517 commented 1 week ago

Closing assuming this is solved. Feel free to re open!

jhj0517 / Whisper-WebUI

Generated subtitles are too long #205