huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Is there a parameter that can keep the numbers in their original language form? #61

Open GrahLnn opened 6 months ago

GrahLnn commented 6 months ago

I am trying to use whisperx for word alignment.

sanchit-gandhi commented 6 months ago

Hey @GrahLnn - sorry I don't fully understand the issue here. Could you possibly explain what you're trying to do and the problem you're facing? Also, if you have a code snippet to demonstrate this issue, that would be most helpful! Thanks!

GrahLnn commented 6 months ago

Sorry I didn't make it clear. WhisperX can obtain word-level timestamps, but because the alignment model it uses does not contain numbers in the dictionary, it cannot provide timestamps for numbers. So my initial thought was whether there is a way to directly output English words to represent these numbers, but yesterday I saw that the "--timestamps" parameter of insanely-fast-whisper supports word-level timestamps. Will distil-whisper have this option? If we can get word-level timestamps directly, I think it might be better than the approach used by whisperx.

sanchit-gandhi commented 5 months ago

Thanks for the clarification! Yes it should be possible to get word-level timestamps by finding the alignment between the cross-attention heads. c.f. this notebook: https://huggingface.co/distil-whisper/distil-small.en/discussions/7#6596c0e693dcb56444287696

I'll run the alignment for the distil-whisper models and update the generation configs as required!

orena1 commented 5 months ago

Thanks @sanchit-gandhi, but what about the original question? is there a way to make sure Whisper outputs numbers with letter and not number tokens? E.g. if the audio contains someone saying: " I have 10 dollars" is there a way to make sure the model outputs: " I have ten dollars"

Thanks

ruimaia commented 5 months ago

Hi @orena1. You can achieve that by suppressing numerical tokens. See here how to do it.