Coconut059 commented 2 months ago

I have tried two Chinese speaker diarization data sets but their results are not good, especially when the human voice is removed as noise. Can this be fine-tuned？

The code I used： USE_ONNX = False # change this to True if you want to test onnx model if USE_ONNX: !pip install -q onnxruntime

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True, onnx=USE_ONNX)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils wav = read_audio('S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)

get speech timestamps from full audio file

speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE) pprint(speech_timestamps)

using VADIterator class

vad_iterator = VADIterator(model) wav = read_audio(f'S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)

window_size_samples = 1536 # number of samples in a single audio chunk for i in range(0, len(wav), window_size_samples): chunk = wav[i: i+ window_size_samples] if len(chunk) < window_size_samples: break speech_dict = vad_iterator(chunk, return_seconds=True) if speech_dict: print(speech_dict, end=' ') vad_iterator.reset_states() # reset model states after each audio

The result on Alimeeting-Test： MS: 20.299598, FA: 1.372215, SER: 1.088590, DER: 22.760403 MS: 31.277793, FA: 2.150170, SER: 1.933873, DER: 35.361836 MS: 31.944428, FA: 0.511342, SER: 2.276318, DER: 34.732088 MS: 47.038586, FA: 0.163343, SER: 9.470302, DER: 56.672231 MS: 74.286394, FA: 0.007934, SER: 3.434961, DER: 77.729289 MS: 30.688677, FA: 0.704153, SER: 2.770183, DER: 34.163013 MS: 59.316559, FA: 0.324209, SER: 8.123554, DER: 67.764322 MS: 98.369565, FA: 0.000000, SER: 0.562652, DER: 98.932217 MS: 99.417771, FA: 0.000000, SER: 0.058597, DER: 99.476368 MS: 99.910412, FA: 0.000000, SER: 0.000000, DER: 99.910412 MS: 99.493029, FA: 0.000000, SER: 0.120111, DER: 99.613140 MS: 61.856814, FA: 0.623673, SER: 0.184956, DER: 62.665443 MS: 19.090301, FA: 4.226608, SER: 3.039757, DER: 26.356666 MS: 33.685372, FA: 0.338829, SER: 0.267496, DER: 34.291696 MS: 15.374482, FA: 4.018866, SER: 0.518013, DER: 19.911360 MS: 42.467802, FA: 1.968425, SER: 0.268384, DER: 44.704612 MS: 17.370355, FA: 0.626849, SER: 0.326430, DER: 18.323634 MS: 67.082939, FA: 0.626243, SER: 0.180605, DER: 67.889787 MS: 72.216975, FA: 0.557994, SER: 0.130966, DER: 72.905935 MS: 14.936698, FA: 1.236910, SER: 0.225926, DER: 16.399534

The result on Aishell-4： MS: 79.665430, FA: 0.012366, SER: 5.601830, DER: 85.279626 MS: 67.227370, FA: 0.132288, SER: 1.020209, DER: 68.379866 MS: 61.530820, FA: 18.205761, SER: 5.297353, DER: 85.033934 MS: 54.602609, FA: 0.152443, SER: 2.483539, DER: 57.238590 MS: 67.082935, FA: 0.078205, SER: 2.599719, DER: 69.760859 MS: 51.416720, FA: 0.204723, SER: 1.379586, DER: 53.001029 MS: 56.959476, FA: 0.203365, SER: 7.326404, DER: 64.489246 MS: 36.057926, FA: 0.157853, SER: 1.157691, DER: 37.373470 MS: 79.330646, FA: 0.097513, SER: 0.407194, DER: 79.835354 MS: 81.295235, FA: 0.062895, SER: 1.192822, DER: 82.550952 MS: 60.887943, FA: 0.599634, SER: 2.776542, DER: 64.264119 MS: 70.418660, FA: 0.084877, SER: 3.336644, DER: 73.840181 MS: 11.451400, FA: 0.658543, SER: 3.846325, DER: 15.956268 MS: 21.339103, FA: 0.351577, SER: 0.758447, DER: 22.449127 MS: 22.068026, FA: 0.588110, SER: 6.252810, DER: 28.908947 MS: 21.507885, FA: 0.162660, SER: 1.766586, DER: 23.437131 MS: 28.836928, FA: 0.203312, SER: 0.167732, DER: 29.207972 MS: 18.727860, FA: 0.238973, SER: 1.228832, DER: 20.195666 MS: 17.108661, FA: 0.269604, SER: 0.083678, DER: 17.461943 MS: 13.953794, FA: 0.308104, SER: 1.880523, DER: 16.142421

adamnsandle commented 2 months ago

Thanks for your comment! We will add these datasets to our validation for more stable future models.

Coconut059 commented 2 months ago

Hi！ Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set？

yuGAN6 commented 1 month ago

Hi！ Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set？

tuning the parameters based on your dataset is necessary. If yours is quiet overall, try lower threshold and longer min_silence_samples, otherwise higher / shorter

snakers4 commented 1 week ago

The new VAD version was released just now - https://github.com/snakers4/silero-vad/issues/2#issuecomment-2195433115.

Now it was trained on more than 6,000 languages.

Can you please test is on your data again.

If the issue persists, please open a new issue referencing this one.

Many thanks!

snakers4 / silero-vad

This vad algorithm does not work well on Chinese data sets #449

get speech timestamps from full audio file

using VADIterator class