English Translation in whisperx has much better result than original transcription

I am trying to use the task=translate to translate some audio from non-english to english or just transcribe in its original language. May I ask what is the functional logic for the translation in the whisper X. I am surprise to find that the translated transcription is usually way better than the transcription in the non-english language. The issue I found in the non-english to its original language transcription:

The speaker diarization result is mixed up in the original language. For example, the translated english transcription successfully assign the 0-2 second to speaker 1 and the 3-5 second to speaker But in the audio's original language's transcription, the 0-5 second are all marked as the same speaker.
Some sentences are missing. For example, the translated english transcription has four sentences. But in the original language's transcription, I see only 3 sentences, one sentence does not show in the original language's transcription but its translated version did show in the english translation version.

In my use case, I am not able to know the language in the audio beforehand, so I cannot set the language argument in the transcribe() function.

I applied the english alignment model at the end, is it possible this is the reason the speaker diarization result got massed up? May I ask what is the workflow of the translation in whisperX? Does it transcribe the audio to its original transcription and then translated it? If this is the case, why the English Translation in whisperx has much better result than original transcription.

Thanks for anyone who can give some hint

m-bain / whisperX

English Translation in whisperx has much better result than original transcription #831