m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.44k stars 1.2k forks source link

Timestamps messed up after splitting the channels with ffmpeg #432

Open andrezarzur opened 1 year ago

andrezarzur commented 1 year ago

Hello,

For a more precise diarization I am separating the stereo audio (left is a person and right is another) into two separate files. The problem is, when I do this and transcribe the files, the timestamps act weird. For example, there is a sentence that occurs at 00:58, but the output timestamps it at 00:41. Is this a known issue? When I don't separate the files the timestamps work perfectly. Also, is what I'm doing a good idea, or is there another way to get perfect diarization with stereo files?

ankitagarwal1996 commented 11 months ago

Hi @andrezarzur, I am trying to implement a similar idea. Did you find success in doing this? Currently, what I am trying to do is using the WhisperX to transcribe the original audio. Using pydub to split the stereo audio into two mono channel audio (1 for each speaker). Then I am using the pydub silent detector to detect silences in the audio, and then using the timestamps from the whisperX output to map speakers to sentences.

andrezarzur commented 11 months ago

Hello @ankitagarwal1996 , I was able to implement what I talked about previously, although the timestamp error persisted. This error seemed to be very occasional and situational, so I just ended up ignoring it. From what you described you are using WhisperX with the stereo file, before separating it right? I was using it like that before but it seemed to be getting confused about who was the speaker from time to time, did you also experience this? That is the reason why I ended up splitting the audio before transcribing, but I'm not sure how viable this is since it technically doubles the amount of processing done for a single file.

ankitagarwal1996 commented 11 months ago

Hey @andrezarzur, I am facing the same issue you described. I will try running the pipeline on the two mono audio files and then merge it for better alignment (and yeah, I agree that it does not sound quite feasible given the latency I am currently facing using my local server setup). One quick question, are you using the wave2vec2 force alignment model for a better alignment? I am not sure how the results of that would look like, so was just curious to know if you have tried it out.

andrezarzur commented 11 months ago

Hey @ankitagarwal1996 , I wasn't aware of this wave2evc2 model, I was just separating the stereo into 2 monos with FFMPEG and transcribing them separately. After some research it seems to be used to basically get the timestamps for each phoneme spoken. From what you suggested, I'm guessing there would be a way to implement these timestamps into the Whisper transcription. Is this right? If so, I can't see how this could be achieved from the top of my head, since you wouldn't be able to know which part of the transcript refers to which timestamp generated. But I could be interpreting this wrongly.

nikola1975 commented 10 months ago

Hello. I am seeing out-of-sync timecodes when using WhisperX just as it is. They are 6-8 seconds off in the 15 minutes videos, for example.

Did you in the end have success in improving timecodes with the approach you are mentioning? Is this the right direction to go into?