snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.41k stars 432 forks source link

❓ can save_audio keep original file's channels number ? #296

Closed sboudouk closed 1 year ago

sboudouk commented 1 year ago

Hi.

First of all , thanks for Silero VAD, this is a really amazing project and it's a pleasure to work with.

I'm using Silero mainly for it's VAD, on a stereo file (2 channels). When getting my segments with save_audio, It seems that I'm losing channels, and end up with a mono file.

Is it possible to keep the file's channels without using a third party lib to split them myself ?

Thanks for the information, I searched but found nothing relevant to this.

snakers4 commented 1 year ago

I'm using Silero mainly for it's VAD, on a stereo file (2 channels). When getting my segments with save_audio, It seems that I'm losing channels, and end up with a mono file.

We really do average the stereo channels. But you can just use the timestamps and save audio using these timestamps.

It kind of seems that this use case is rare, but a PR would be appreciated, but I believe it will complicate the code unnecessarily.

The correct solution, I believe, would be to write a simple script on top of the provided utils, i.e. some small middleware, which would apply VAD to each track separately and then merge them somehow (e.g. apply AND or OR to timestamps of both tracks).

sboudouk commented 1 year ago

Thanks for the help @snakers4

I don't know if my question was very clear but, I do not want to apply VAD on each channel separately, the current VAD behaviour is great to me.

I just want the save_audio to return me my audio segmented, but as a stereo (because it's what I sent to it). But it looks that it is not possible at the moment neither.

I will use another audio library to split my audio as I want as you suggested.

Is it possible to convert samples returned by get_speech_timestamps to precise milliseconds instead of seconds ? Or can I use the samples values directly to segment my audios using another lib like pydub or wave ?

Sorry I'm pretty new to the audio world so those questions might be stupid.

Thanks for your help and the valuable work.

snakers4 commented 1 year ago

Is it possible to convert samples returned by get_speech_timestamps to precise milliseconds instead of seconds ?

Just divide samples by your sample rate, and that's it.

Or can I use the samples values directly to segment my audios using another lib like pydub or wave ?

I can tell you how I would do it, for a wav file for example. I would get the VAD timestamps, read an audio into an array sized (length, 2 channels) and then I would just slice my stereo array using the sample timings from VAD. The number of samples literally is the length of the array.

sboudouk commented 1 year ago

Perfect, that's a very useful hint there.

Thanks for your help @snakers4 !

Simon-chai commented 1 year ago

I'm using Silero mainly for it's VAD, on a stereo file (2 channels). When getting my segments with save_audio, It seems that I'm losing channels, and end up with a mono file.

We really do average the stereo channels. But you can just use the timestamps and save audio using these timestamps.

It kind of seems that this use case is rare, but a PR would be appreciated, but I believe it will complicate the code unnecessarily.

The correct solution, I believe, would be to write a simple script on top of the provided utils, i.e. some small middleware, which would apply VAD to each track separately and then merge them somehow (e.g. apply AND or OR to timestamps of both tracks).

apply VAD to each track separately is exactly what I do in the first place, but I find out that this simply double the dealing time, so here I am, to find out if there any more clever way to do it? otherwise I need to double my machines

phineas-pta commented 11 months ago

here my simple conversion:

import torch
import torchaudio

SAMPLING_RATE = 16000
MODEL, (get_speech_timestamps, _, read_audio, _, _) = torch.hub.load(
    repo_or_dir="snakers4/silero-vad", model="silero_vad", onnx=False
)

wav = read_audio("INPUT PATH", sampling_rate=SAMPLING_RATE)
speech_timestamps = get_speech_timestamps(wav, MODEL, sampling_rate=SAMPLING_RATE)

waveform, sample_rate = torchaudio.load("INPUT PATH")
ratio = sample_rate / SAMPLING_RATE
cut_waveform = torch.cat([
    waveform[:, int(el["start"] * ratio) : int(el["end"] * ratio)]
    for el in speech_timestamps
], dim=1)
torchaudio.save("OUTPUT PATH", cut_waveform, sample_rate)