thomasmol / cog-whisper-diarization

Cog implementation of transcribing + diarization pipeline with Whisper & Pyannote
https://replicate.com/thomasmol/whisper-diarization
151 stars 44 forks source link

Non-english (hindi,arabic) audio failing on cleaning phase #14

Closed getpaoapps closed 6 days ago

getpaoapps commented 1 month ago

I notice infrequent failures for non-english audios with the error: Error Running inference with local model', IndexError('list index out of range'.

Example of failing audio.

Error log:

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil      56. 70.100 / 56. 70.100
libavcodec     58.134.100 / 58.134.100
libavformat    58. 76.100 / 58. 76.100
libavdevice    58. 13.100 / 58. 13.100
libavfilter     7.110.100 /  7.110.100
libswscale      5.  9.100 /  5.  9.100
libswresample   3.  9.100 /  3.  9.100
libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'temp-1723460479850449730.audio':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
comment         : vid:v14044g50000cqquupfog65q3u02hu40
aigc_info       : {"aigc_label_type": 0}
vid_md5         : 92a875b0942f88047008326ca2b2e5b7
encoder         : Lavf58.76.100
Duration: 00:03:21.13, start: 0.000000, bitrate: 85 kb/s
Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(tv, bt709), 576x576 [SAR 1:1 DAR 1:1], 46 kb/s, 30 fps, 30 tbr, 15360 tbn, 60 tbc (default)
Metadata:
handler_name    : VideoHandler
vendor_id       : [0][0][0][0]
Stream #0:1(und): Audio: aac (HE-AACv2) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 32 kb/s (default)
Metadata:
handler_name    : SoundHandler
vendor_id       : [0][0][0][0]
Stream mapping:
Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to 'temp-1723460440106808442.wav':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
ICMT            : vid:v14044g50000cqquupfog65q3u02hu40
aigc_info       : {"aigc_label_type": 0}
vid_md5         : 92a875b0942f88047008326ca2b2e5b7
ISFT            : Lavf58.76.100
Stream #0:0(und): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s (default)
Metadata:
handler_name    : SoundHandler
vendor_id       : [0][0][0][0]
encoder         : Lavc58.134.100 pcm_s16le
size=       0kB time=00:00:00.16 bitrate=   9.0kbits/s speed= 563x
size=    6280kB time=00:03:21.10 bitrate= 255.8kbits/s speed= 735x
video:0kB audio:6280kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.001928%
Starting transcribing
Finished with transcribing, took 1.4529 seconds
Starting diarization
Finished with diarization, took 3.0839 seconds
Starting merging
Finished with merging, took 0.00016642 seconds
Starting cleaning
Traceback (most recent call last):
File "/src/predict.py", line 154, in predict
segments, detected_num_speakers, detected_language = self.speech_to_text(
File "/src/predict.py", line 322, in speech_to_text
"start": segments[0]["start"],
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.14/lib/python3.10/site-packages/cog/server/worker.py", line 221, in _predict
result = predict(**payload)
File "/src/predict.py", line 175, in predict
raise RuntimeError("Error Running inference with local model", e)
RuntimeError: ('Error Running inference with local model', IndexError('list index out of range'))
thomasmol commented 1 month ago

Interesting. It seems the transcribing actually finished, but there was nothing to transcribe.

Listening to your audio it seems it's singing + music rather than clear speech. I am guessing the VAD detects no actual words spoken, or Whisper can't recognize any speech, therefore nothing gets transcribed, causing this error (out of range error happens because the it tries to get data of the first segment, which does not exist). But I'm unsure if this is exactly the case.

Do you have more similar sounding audio files where it does and does not work?

getpaoapps commented 1 month ago

Here is another example. Do you think it makes sense to add check on segments count and return empty output?

thomasmol commented 1 month ago

yes good idea! if there are no segments in the output, then it should just return an empty array, not produce an error