m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
10.68k stars 1.14k forks source link

Vocabulary files #324

Open sorgfresser opened 1 year ago

sorgfresser commented 1 year ago

Often there are very specific names or brands that should be mentioned correctly in a transcript. Right now, there is no nice way to do this. Maybe we could add this in the future? I'd like to know ideas on how to enable this. My current ones are

I have never worked with them until now, as such I do not know what is the usual way of handling this. Maybe someone knows.

DDematto commented 1 year ago

I am working with a similar issue, basically I am trying to come up with a method of aligning transcription results to song lyrics. I have tried a few different alignment algorithms and scores like Levenshtein distance and Metaphones scores (phonetic comparison). This would be a really cool feature/enhancement if we could have the transcription segments be aligned with our own custom vocabulary on top of the model.

LeeHaha314 commented 1 year ago

Hi guys, I am dealing with the same issue now. I have tried the official initial_prompt parameter to pass my vocab list, but it didn't work as expected. Do you get any progress with this issue, or any new ideas? @sorgfresser @DDematto In Chinese it's unsuitable to apply some methods like levenshtein distance since similar pronouciation could result in totally different character. As for phonetic matching, it might be a solution, but how to combine it with local personal vocabulary also takes time to figure out. Considering that some processes have already been contained in the whisper transcribe pipeline, I think official solution could be a better and more elegant way than postprocess.

Update: Now I use another tokenizer with better performance on Chinese text and a tool computing phonetic distance of two words based on pinyin to match source text with my vocab list. I code the postprocess logic. To some extent, it does work with some limits. And I also need to contain a whitelist to skip some similar words in my vocab list manually. Looking forward to other inspired ideas lol

kurianbenoy-sentient commented 9 months ago

@LeeHaha314 can you share about the tokenizer you used in WhisperX which helped you get good results?