Closed Coconut059 closed 1 week ago
Thanks for your comment! We will add these datasets to our validation for more stable future models.
Hi! Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set?
Hi! Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set?
tuning the parameters based on your dataset is necessary. If yours is quiet overall, try lower threshold and longer min_silence_samples, otherwise higher / shorter
The new VAD version was released just now - https://github.com/snakers4/silero-vad/issues/2#issuecomment-2195433115.
Now it was trained on more than 6,000 languages.
Can you please test is on your data again.
If the issue persists, please open a new issue referencing this one.
Many thanks!
I have tried two Chinese speaker diarization data sets but their results are not good, especially when the human voice is removed as noise. Can this be fine-tuned?
The code I used: USE_ONNX = False # change this to True if you want to test onnx model if USE_ONNX: !pip install -q onnxruntime
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True, onnx=USE_ONNX)
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils wav = read_audio('S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)
get speech timestamps from full audio file
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE) pprint(speech_timestamps)
using VADIterator class
vad_iterator = VADIterator(model) wav = read_audio(f'S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)
window_size_samples = 1536 # number of samples in a single audio chunk for i in range(0, len(wav), window_size_samples): chunk = wav[i: i+ window_size_samples] if len(chunk) < window_size_samples: break speech_dict = vad_iterator(chunk, return_seconds=True) if speech_dict: print(speech_dict, end=' ') vad_iterator.reset_states() # reset model states after each audio
The result on Alimeeting-Test: MS: 20.299598, FA: 1.372215, SER: 1.088590, DER: 22.760403 MS: 31.277793, FA: 2.150170, SER: 1.933873, DER: 35.361836 MS: 31.944428, FA: 0.511342, SER: 2.276318, DER: 34.732088 MS: 47.038586, FA: 0.163343, SER: 9.470302, DER: 56.672231 MS: 74.286394, FA: 0.007934, SER: 3.434961, DER: 77.729289 MS: 30.688677, FA: 0.704153, SER: 2.770183, DER: 34.163013 MS: 59.316559, FA: 0.324209, SER: 8.123554, DER: 67.764322 MS: 98.369565, FA: 0.000000, SER: 0.562652, DER: 98.932217 MS: 99.417771, FA: 0.000000, SER: 0.058597, DER: 99.476368 MS: 99.910412, FA: 0.000000, SER: 0.000000, DER: 99.910412 MS: 99.493029, FA: 0.000000, SER: 0.120111, DER: 99.613140 MS: 61.856814, FA: 0.623673, SER: 0.184956, DER: 62.665443 MS: 19.090301, FA: 4.226608, SER: 3.039757, DER: 26.356666 MS: 33.685372, FA: 0.338829, SER: 0.267496, DER: 34.291696 MS: 15.374482, FA: 4.018866, SER: 0.518013, DER: 19.911360 MS: 42.467802, FA: 1.968425, SER: 0.268384, DER: 44.704612 MS: 17.370355, FA: 0.626849, SER: 0.326430, DER: 18.323634 MS: 67.082939, FA: 0.626243, SER: 0.180605, DER: 67.889787 MS: 72.216975, FA: 0.557994, SER: 0.130966, DER: 72.905935 MS: 14.936698, FA: 1.236910, SER: 0.225926, DER: 16.399534
The result on Aishell-4: MS: 79.665430, FA: 0.012366, SER: 5.601830, DER: 85.279626 MS: 67.227370, FA: 0.132288, SER: 1.020209, DER: 68.379866 MS: 61.530820, FA: 18.205761, SER: 5.297353, DER: 85.033934 MS: 54.602609, FA: 0.152443, SER: 2.483539, DER: 57.238590 MS: 67.082935, FA: 0.078205, SER: 2.599719, DER: 69.760859 MS: 51.416720, FA: 0.204723, SER: 1.379586, DER: 53.001029 MS: 56.959476, FA: 0.203365, SER: 7.326404, DER: 64.489246 MS: 36.057926, FA: 0.157853, SER: 1.157691, DER: 37.373470 MS: 79.330646, FA: 0.097513, SER: 0.407194, DER: 79.835354 MS: 81.295235, FA: 0.062895, SER: 1.192822, DER: 82.550952 MS: 60.887943, FA: 0.599634, SER: 2.776542, DER: 64.264119 MS: 70.418660, FA: 0.084877, SER: 3.336644, DER: 73.840181 MS: 11.451400, FA: 0.658543, SER: 3.846325, DER: 15.956268 MS: 21.339103, FA: 0.351577, SER: 0.758447, DER: 22.449127 MS: 22.068026, FA: 0.588110, SER: 6.252810, DER: 28.908947 MS: 21.507885, FA: 0.162660, SER: 1.766586, DER: 23.437131 MS: 28.836928, FA: 0.203312, SER: 0.167732, DER: 29.207972 MS: 18.727860, FA: 0.238973, SER: 1.228832, DER: 20.195666 MS: 17.108661, FA: 0.269604, SER: 0.083678, DER: 17.461943 MS: 13.953794, FA: 0.308104, SER: 1.880523, DER: 16.142421