Closed EarningsCall closed 1 month ago
Output timestamps don't make any sense. The values are in the hundreds of thousands. This isn't seconds or milliseconds.
The input WAV file is 16k sampling rate, mono. It is one minute in length (60 seconds).
Steps to reproduce the behavior:
$ ipython Python 3.9.19 (main, Aug 20 2024, 11:09:40) Type 'copyright', 'credits' or 'license' for more information IPython 8.18.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: ...: SAMPLING_RATE = 16000 ...: ...: import torch ...: torch.set_num_threads(1) ...: ...: from IPython.display import Audio ...: from pprint import pprint ...: # download example In [2]: torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', 'en_example.wav') 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83M/1.83M [00:01<00:00, 1.37MB/s] In [3]: ...: ...: USE_PIP = True # download model using pip package or torch.hub ...: USE_ONNX = False # change this to True if you want to test onnx model ...: if USE_ONNX: ...: !pip install -q onnxruntime ...: if USE_PIP: ...: !pip install -q silero-vad ...: from silero_vad import (load_silero_vad, ...: read_audio, ...: get_speech_timestamps, ...: save_audio, ...: VADIterator, ...: collect_chunks) ...: model = load_silero_vad(onnx=USE_ONNX) ...: else: ...: model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', ...: model='silero_vad', ...: force_reload=True, ...: onnx=USE_ONNX) ...: ...: (get_speech_timestamps, ...: save_audio, ...: read_audio, ...: VADIterator, ...: collect_chunks) = utils ...: [clipped]/.pyenv/versions/3.9.19/lib/python3.9/site-packages/IPython/core/interactiveshell.py:2646: UserWarning: You executed the system command !pip which may not work as expected. Try the IPython magic %pip instead. warnings.warn( In [4]: wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE) ...: # get speech timestamps from full audio file ...: speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE) ...: pprint(speech_timestamps) [{'end': 33248, 'start': 32}, {'end': 77792, 'start': 42528}, {'end': 109536, 'start': 79392}, {'end': 214496, 'start': 149024}, {'end': 243168, 'start': 216608}, {'end': 253408, 'start': 245280}, {'end': 286688, 'start': 260640}, {'end': 313824, 'start': 293920}, {'end': 602080, 'start': 325152}, {'end': 622048, 'start': 607264}, {'end': 693216, 'start': 638496}, {'end': 713184, 'start': 697888}, {'end': 749536, 'start': 720416}, {'end': 799200, 'start': 781344}, {'end': 855008, 'start': 817184}, {'end': 960000, 'start': 856608}]
silero-vad version is 5.1.2
5.1.2
$ ffprobe en_example.wav ffprobe version 7.1 Copyright (c) 2007-2024 the FFmpeg developers built with Apple clang version 16.0.0 (clang-1600.0.26.3) configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/7.1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox --enable-neon libavutil 59. 39.100 / 59. 39.100 libavcodec 61. 19.100 / 61. 19.100 libavformat 61. 7.100 / 61. 7.100 libavdevice 61. 3.100 / 61. 3.100 libavfilter 10. 4.100 / 10. 4.100 libswscale 8. 3.100 / 8. 3.100 libswresample 5. 3.100 / 5. 3.100 libpostproc 58. 3.100 / 58. 3.100 Input #0, wav, from 'en_example.wav': Duration: 00:01:00.00, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s
OK, I just realized the values are samples (not seconds) after re-reading the documentation. Sorry
🐛 Bug
Output timestamps don't make any sense. The values are in the hundreds of thousands. This isn't seconds or milliseconds.
The input WAV file is 16k sampling rate, mono. It is one minute in length (60 seconds).
To Reproduce
Steps to reproduce the behavior:
silero-vad version is
5.1.2