[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1`

pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

http://pyannote.github.io

MIT License

6.31k stars 778 forks source link

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1` #1755

Open ai-nikolai opened 2 months ago

ai-nikolai commented 2 months ago

Tested versions

Reproducible in 3.1, 3.3

System information

M2 Pro, 3.3

Issue description

Sample Rate Mis-match:

Using the pre-trained pipeline: pyannote/speaker-diarization-3.1
Which has a default sample rate of 16K
We get vastly different performance on simple audio files from e.g. youtube in terms of segmentation. (With 44.1K performing better).

Question:

What are the default sample rates based on?
How does the up / down-sampling work concretely? (I.e. what could be the reason for such different behviour) -> How does the pipeline work if one passes sample rate 44.1K?
Would this depend on the original encoding quality?

@hbredin

Minimal reproduction example (MRE)

N/A

ai-nikolai commented 2 months ago

@hbredin

ai-nikolai commented 2 months ago

Thank you for responding @hbredin. I will try and add a minimal reproducible script in coming days. However, in the mean time I have a quick question.

Basically what could be the reason for a big difference between loading audiofiles as 44.1K vs. 16K and passing them as waveforms to pyannote/speaker-diarization-3.1?

qalabeabbas49 commented 2 months ago

as far as I know, pyannote will convert any audio into mono channel 16khz. In my experience, generally audio files recorded at a higher sample rate (44khz) will always perform well just because they have more information even after downsampling to 16khz. While a file recorded at 16khz has less information.

ai-nikolai commented 2 months ago

Thank you, qalabeabbas49. I guess what I find interesting is that the audio file is the same. I.e. originally 44.1K or originally 16K. And then: The original file gets loaded in either 44.1K or 16K and then pyannote converts to 16K (as you said). Loading this file in 44.1K makes a difference - not whether the file was originally 44.1K.

(loading via ffmpeg -ar 16000; or ffmpeg -ar 44100)

lockmatrix commented 2 weeks ago

same to me, 44.1K performing better