m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
10.67k stars 1.14k forks source link

Silence with background noise causing transcript to misalign #297

Open jrappeneker opened 1 year ago

jrappeneker commented 1 year ago

Another small issue. Sometimes the transcriptions are all shifted significantly forward in time. So that the transcription occurs seconds before the speech. This usually adjusts itself later in the transcript.

Screenshot 2023-05-31 at 10 31 14

In this screenshot, the word "I'm" appears in the transcript at the position displayed at the bottom of the screen, but in the audio it is not spoken until the highlighted section on the right. Before that, there is only low background noise. This seems to happen mostly in moments of no speech. I wonder if the way the audio is cut up before processing is causing this.

sorgfresser commented 1 year ago

That's really interesting! I experienced this myself but can not find the waveform that produced this. It would be interesting to know if the vad-segment is correct or if it's already off. If the VAD is the issue, it would be nice if we could tell whether it is caused by the merge_chunks operation or already introduced by pyannote. Could you give me an example that reproduces this behaviour or check the things mentioned?

hoonlight commented 1 year ago

I experienced the same thing, with audio processed with demucs.

jrappeneker commented 1 year ago

@sorgfresser I do have an example I can share with you, it's speech data from a research project that needs to remain anonymous so I cannot share it publicly. Would it be okay for me to email it to you?

RimaAisulu commented 1 year ago

@jrappeneker btw, what's the tool you used for the above visualization?

jrappeneker commented 1 year ago

@RimaAisulu This is Praat, an excellent tool for doing speech analysis.

I wrote a brief script to convert the json output from whisper-x into a Praat readable file so that the output can be checked visually.

sorgfresser commented 1 year ago

Hey @jrappeneker I can certainly take a look (even though it might take some days). You can reach me at simon.sorg@student.hpi.de Just send me an NDA or something similar if you need me to sign one beforehand.

hammad26 commented 10 months ago

I am also facing a similar issue on a multi-channel audio. I am saving both channels separately and doing transcription 1 by 1. After that I am merging the response of both, based on the start time of each segment. There I observed the similar behavior i.e First speaker was listening to the second speaker from 9s to 15s and then first speaker spoke from 16s to 22s, but transcription results are showing that first speaker spoke from 8s to 22s. Its including whole silence part of that channel with transcription time which is destroying the order in which each speaker spoke. Thus the whole theory of multi-channel transcription is badly rejected.

Mabenan commented 3 months ago

I'm also facing same issue in my case it is an anime smippet the characters make some loud noises (like oh, äh) around 4-5 seconds before one character speak. The noises are not transcripted and then causes the segment to be completly off and not visible when the speech os happening.

Can we adjust the length of the segments maybe this can help

rbgreenway commented 1 month ago

Facing the same issue. I have multi-channel audio (phone calls) where each speaker is on a different channel. Trying to transcribe independently and then match up the transcripts based upon segment/word timestamps. Those long segments of silence seem to be causing issues. Also, if I have a channel that starts with more than 30 seconds of silence, it doesn't transcribe at all...returns an empty transcription. This last issue might be related to the vad settings. If anyone has addressed this issue, it would be great to hear how.