snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.43k stars 432 forks source link

Bug report - Timestamps are nonsensical #551

Closed EarningsCall closed 1 month ago

EarningsCall commented 1 month ago

🐛 Bug

Output timestamps don't make any sense. The values are in the hundreds of thousands. This isn't seconds or milliseconds.

The input WAV file is 16k sampling rate, mono. It is one minute in length (60 seconds).

To Reproduce

Steps to reproduce the behavior:

$ ipython
Python 3.9.19 (main, Aug 20 2024, 11:09:40)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
   ...: SAMPLING_RATE = 16000
   ...:
   ...: import torch
   ...: torch.set_num_threads(1)
   ...:
   ...: from IPython.display import Audio
   ...: from pprint import pprint
   ...: # download example

In [2]: torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', 'en_example.wav')
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83M/1.83M [00:01<00:00, 1.37MB/s]

In [3]:
   ...:
   ...: USE_PIP = True # download model using pip package or torch.hub
   ...: USE_ONNX = False # change this to True if you want to test onnx model
   ...: if USE_ONNX:
   ...:   !pip install -q onnxruntime
   ...: if USE_PIP:
   ...:   !pip install -q silero-vad
   ...:   from silero_vad import (load_silero_vad,
   ...:                           read_audio,
   ...:                           get_speech_timestamps,
   ...:                           save_audio,
   ...:                           VADIterator,
   ...:                           collect_chunks)
   ...:   model = load_silero_vad(onnx=USE_ONNX)
   ...: else:
   ...:   model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
   ...:                                 model='silero_vad',
   ...:                                 force_reload=True,
   ...:                                 onnx=USE_ONNX)
   ...:
   ...:   (get_speech_timestamps,
   ...:   save_audio,
   ...:   read_audio,
   ...:   VADIterator,
   ...:   collect_chunks) = utils
   ...:
[clipped]/.pyenv/versions/3.9.19/lib/python3.9/site-packages/IPython/core/interactiveshell.py:2646: UserWarning: You executed the system command !pip which may not work as expected. Try the IPython magic %pip instead.
  warnings.warn(

In [4]: wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)
   ...: # get speech timestamps from full audio file
   ...: speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
   ...: pprint(speech_timestamps)
[{'end': 33248, 'start': 32},
 {'end': 77792, 'start': 42528},
 {'end': 109536, 'start': 79392},
 {'end': 214496, 'start': 149024},
 {'end': 243168, 'start': 216608},
 {'end': 253408, 'start': 245280},
 {'end': 286688, 'start': 260640},
 {'end': 313824, 'start': 293920},
 {'end': 602080, 'start': 325152},
 {'end': 622048, 'start': 607264},
 {'end': 693216, 'start': 638496},
 {'end': 713184, 'start': 697888},
 {'end': 749536, 'start': 720416},
 {'end': 799200, 'start': 781344},
 {'end': 855008, 'start': 817184},
 {'end': 960000, 'start': 856608}]

silero-vad version is 5.1.2

$ ffprobe en_example.wav
ffprobe version 7.1 Copyright (c) 2007-2024 the FFmpeg developers
  built with Apple clang version 16.0.0 (clang-1600.0.26.3)
  configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/7.1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox --enable-neon
  libavutil      59. 39.100 / 59. 39.100
  libavcodec     61. 19.100 / 61. 19.100
  libavformat    61.  7.100 / 61.  7.100
  libavdevice    61.  3.100 / 61.  3.100
  libavfilter    10.  4.100 / 10.  4.100
  libswscale      8.  3.100 /  8.  3.100
  libswresample   5.  3.100 /  5.  3.100
  libpostproc    58.  3.100 / 58.  3.100
Input #0, wav, from 'en_example.wav':
  Duration: 00:01:00.00, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s
EarningsCall commented 1 month ago

OK, I just realized the values are samples (not seconds) after re-reading the documentation. Sorry