snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.45k stars 435 forks source link

[EDGE CASE] Cartoon voice worse performance on v5 than older version #563

Open George0828Zhang opened 1 month ago

George0828Zhang commented 1 month ago

🐛 Bug

V5 ignores cartoon voices.

To Reproduce

Steps to reproduce the behavior:

  1. Using colab example
  2. Download this example and run until this cell (change 'en_example.wav' to 'ja_example.wav'):
    wav = read_audio('ja_example.wav', sampling_rate=SAMPLING_RATE)
    # get speech timestamps from full audio file
    speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
    pprint(speech_timestamps)
  3. The result is:
    [{'end': 30464, 'start': 12032}]

    while if old version is used (see https://github.com/SYSTRAN/faster-whisper/issues/934#issuecomment-2439340290), the result is

    [{'end': 40192, 'start': 12032},
    {'end': 179456, 'start': 76544},
    {'end': 379136, 'start': 273152},
    {'end': 457984, 'start': 422656},
    {'end': 630016, 'start': 576256},
    {'end': 669952, 'start': 653056},
    {'end': 863488, 'start': 695040},
    {'end': 950528, 'start': 896768}]

Expected behavior

V5 should be better than older version.

Environment

Please copy and paste the output from this environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Additional context

snakers4 commented 1 month ago

Can you send an audio sample?

George0828Zhang commented 1 month ago

Here: https://drive.google.com/file/d/1NPvEybP0VU1dFmd6neH6JJRW_Qm2MXdk/view

thanks for looking into this!