pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.31k stars 778 forks source link

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1` #1755

Open ai-nikolai opened 2 months ago

ai-nikolai commented 2 months ago

Tested versions

Reproducible in 3.1, 3.3

System information

M2 Pro, 3.3

Issue description

Sample Rate Mis-match:


Question:

@hbredin

Minimal reproduction example (MRE)

N/A

ai-nikolai commented 2 months ago

@hbredin

ai-nikolai commented 2 months ago

Thank you for responding @hbredin. I will try and add a minimal reproducible script in coming days. However, in the mean time I have a quick question.

qalabeabbas49 commented 2 months ago

as far as I know, pyannote will convert any audio into mono channel 16khz. In my experience, generally audio files recorded at a higher sample rate (44khz) will always perform well just because they have more information even after downsampling to 16khz. While a file recorded at 16khz has less information.

ai-nikolai commented 2 months ago

Thank you, qalabeabbas49. I guess what I find interesting is that the audio file is the same. I.e. originally 44.1K or originally 16K. And then: The original file gets loaded in either 44.1K or 16K and then pyannote converts to 16K (as you said). Loading this file in 44.1K makes a difference - not whether the file was originally 44.1K.

(loading via ffmpeg -ar 16000; or ffmpeg -ar 44100)

lockmatrix commented 2 weeks ago

same to me, 44.1K performing better