[EDGE CASE] Cartoon voice worse performance on v5 than older version

George0828Zhang commented 1 month ago

🐛 Bug

V5 ignores cartoon voices.

To Reproduce

Steps to reproduce the behavior:

Using colab example

Download this example and run until this cell (change 'en_example.wav' to 'ja_example.wav'):

wav = read_audio('ja_example.wav', sampling_rate=SAMPLING_RATE)
# get speech timestamps from full audio file
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
pprint(speech_timestamps)

The result is:

[{'end': 30464, 'start': 12032}]

while if old version is used (see https://github.com/SYSTRAN/faster-whisper/issues/934#issuecomment-2439340290), the result is

[{'end': 40192, 'start': 12032},
{'end': 179456, 'start': 76544},
{'end': 379136, 'start': 273152},
{'end': 457984, 'start': 422656},
{'end': 630016, 'start': 576256},
{'end': 669952, 'start': 653056},
{'end': 863488, 'start': 695040},
{'end': 950528, 'start': 896768}]

Expected behavior

V5 should be better than older version.

Environment

Please copy and paste the output from this environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

snakers4 commented 1 month ago

Can you send an audio sample?

George0828Zhang commented 1 month ago

Here: https://drive.google.com/file/d/1NPvEybP0VU1dFmd6neH6JJRW_Qm2MXdk/view

thanks for looking into this!

snakers4 / silero-vad