pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.29k stars 777 forks source link

Warnings in Diarization Pipeline about window size mismatch #542

Closed vadimkantorov closed 3 years ago

vadimkantorov commented 3 years ago

Maybe some torchhub configs don't match torchhub pretrained weights: UserWarning: Model was trained with 4s chunks and is applied on 2s chunks. This might lead to sub-optimal results. The full error is below:

wget http://groups.inf.ed.ac.uk/ami/AMICorpusMirror/amicorpus/ES2004a/audio/ES2004a.Mix-Headset.wav
CUDA_VISIBLE_DEVICES=3 python3 bug.py
import soundfile
import torch

audio_path = 'ES2004a.Mix-Headset.wav'
signal, sample_rate = soundfile.read(audio_path, dtype = 'float32', always_2d = True)
assert sample_rate == 16_000

pipeline = torch.hub.load('pyannote/pyannote-audio', 'dia')

res = pipeline(dict(waveform = signal))
Downloading: "https://github.com/pyannote/pyannote-audio/archive/master.zip" to /root/.cache/torch/hub/master.zip
Using cache found in /root/.cache/torch/hub/pyannote_pyannote-audio_master
Using cache found in /root/.cache/torch/hub/pyannote_pyannote-audio_master
Using cache found in /root/.cache/torch/hub/pyannote_pyannote-audio_master
/miniconda/lib/python3.8/site-packages/pyannote/audio/embedding/approaches/arcface_loss.py:170: FutureWarning: The 's' parameter is deprecated in favor of 'scale', and will be removed in a future release
  warnings.warn(msg, FutureWarning)
/miniconda/lib/python3.8/site-packages/pyannote/audio/features/pretrained.py:156: UserWarning: Model was trained with 4s chunks and is applied on 2s chunks. This might lead to sub-optimal results.
  warnings.warn(msg)
Using cache found in /root/.cache/torch/hub/pyannote_pyannote-audio_master
/miniconda/lib/python3.8/site-packages/sklearn/cluster/_affinity_propagation.py:146: FutureWarning: 'random_state' has been introduced in 0.23. It will be set to None starting from 0.25 which means that results will differ at every function call. Set 'random_state' to None to silence this warning, or to 0 to keep the behavior of versions <0.23.
  warnings.warn(("'random_state' has been introduced in 0.23. "
hbredin commented 3 years ago

Thanks for reporting this issue.

This message is, in fact, misleading because the model was trained with variable length chunks (between min_duration=1 second and duration=4 seconds). So, using a 2s window is fine.

FYI, v2 is being prepared in develop branch (as a complete rewrite) so I most likely won't fix this issue. I'd still consider a PR on master though.

ahri704 commented 3 years ago

I am also getting this same warning: "UserWarning: Model was trained with 4s chunks and is applied on 2s chunks. This might lead to sub-optimal results.".

However, I think this waring is having a real effect with pyannote-audio diarization.

When I record myself repeating hello for 2s and then a gap and then goodbye for 2s, pyannote reports two speakers. When I record myself repeating hello for 4s and then a gap and then goodbye for 4s, pyannote reports one speaker.

System config: Ubuntu 20.04 LTS (running on Windows 10 using WSL 1) python 3.8.5

pyannote-audio setup using the following commands:

pip install torch pip install pyannote.audio==1.1.1


import torch pipeline = torch.hub.load('pyannote/pyannote-audio', 'dia')

diarization = pipeline({'audio': './hello-goodbye-c1-r16.wav'})

with open('./hello-goodbye.rttm', 'w') as f: diarization.write_rttm(f)

for turn, _, speaker in diarization.itertracks(yield_label=True): print(f'Speaker "{speaker}" speaks between t={turn.start:.1f}s and t={turn.end:.1f}s.')


For short hello-goodbye-c1-r16.wav, output is

Speaker "A" speaks between t=1.4s and t=1.7s. Speaker "B" speaks between t=3.4s and t=3.9s.

For long hello-goodbye-c1-r16.wav, output is

/home/user/.local/lib/python3.8/site-packages/sklearn/cluster/_affinity_propagation.py:136: UserWarning: All samples have mutually equal similarities. Returning arbitrary cluster center(s). warnings.warn("All samples have mutually equal similarities. " Speaker "A" speaks between t=1.0s and t=4.1s. Speaker "A" speaks between t=4.3s and t=4.9s. Speaker "A" speaks between t=7.0s and t=10.4s.

How can I get speaker diarization working with short utterances? Thanks.

ahri704 commented 3 years ago

After further testing, it looks like pyannote-audio needs at least 4s of continuous speech for proper diarization.

If I create an audio file with many short utterances (around 1s) and one long utterance of around 4s, then the diarization correctly identifies a single speaker. It does not matter if the long utterance is in the front, or the middle or the end of the audio file - there just needs to be a long utterance.

However, if I test with an audio file containing many short utterances (around 1s) and no long utterance, the diarization shows a different speaker for each utterance.

Thanks.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.