m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.37k stars 1.19k forks source link

English Translation in whisperx has much better result than original transcription #831

Open DFanny-5 opened 2 months ago

DFanny-5 commented 2 months ago

I am trying to use the task=translate to translate some audio from non-english to english or just transcribe in its original language. May I ask what is the functional logic for the translation in the whisper X. I am surprise to find that the translated transcription is usually way better than the transcription in the non-english language. The issue I found in the non-english to its original language transcription:

  1. The speaker diarization result is mixed up in the original language. For example, the translated english transcription successfully assign the 0-2 second to speaker 1 and the 3-5 second to speaker But in the audio's original language's transcription, the 0-5 second are all marked as the same speaker.

  2. Some sentences are missing. For example, the translated english transcription has four sentences. But in the original language's transcription, I see only 3 sentences, one sentence does not show in the original language's transcription but its translated version did show in the english translation version.

In my use case, I am not able to know the language in the audio beforehand, so I cannot set the language argument in the transcribe() function.

I applied the english alignment model at the end, is it possible this is the reason the speaker diarization result got massed up? May I ask what is the workflow of the translation in whisperX? Does it transcribe the audio to its original transcription and then translated it? If this is the case, why the English Translation in whisperx has much better result than original transcription.

Thanks for anyone who can give some hint

debalabbas commented 1 month ago

@DFanny-5 , WhisperX is built on top of Whisper which has been trained in multilingual and multitask fashion so it does, the translation end to end no intermediate outputs are generated. The reason for poor transcription in original language can be due to the amount of lessdata for x->x transcription task. Here is an image from the original whisper paper that shows the data distribution

image

You can refer to this and see much data was used for the language you were trying most probably it is less than the english translation task