pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
5.46k stars 724 forks source link

outputs of separation module is clipping #1729

Open faroit opened 1 week ago

faroit commented 1 week ago

Tested versions

System information

macOS, m1

Issue description

Hi @hbredin, @joonaskalda thanks for this great release!

I tried some examples on the new pixit pipeline and I find outputs of the separation module seem to produce a very high level of clipping. Is this to be expected from the way it was trained with scale-invariant losses?

Input was a downsampled 16khz mono wav file from the youtube excerpt linked below.

image

Minimal reproduction example (MRE)

https://www.youtube.com/watch?v=CGUpPyA48jE&t=182s

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speech-separation-ami-1.0",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization, sources = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

# dump sources to disk as SPEAKER_XX.wav files
import scipy.io.wavfile
for s, speaker in enumerate(diarization.labels()):
    scipy.io.wavfile.write(f'{speaker}.wav', 16000, sources.data[:,s])
joonaskalda commented 1 week ago

Hi @faroit, thank you for your interest in PixIT! I suspect the issue is that the current version is trained only on the AMI meeting dataset. On the AMI test set this hasn’t been an issue. Finetuning on domain-specific audio would likely improve the separation performance.

faroit commented 1 week ago

@joonaskalda thanks for your reply. I am not sure if fine-tuning would really be able to fix any of this. I digged a bit deeper and saw that the maximum output after separation is about 81.0 in that example. Also interesting is that it also drifts in terms of bias. Here is the peak-normalized output of speaker 1

image

Was the model trained on zero-mean, unit variance data?

joonaskalda commented 1 week ago

Thanks for investigating. I checked and the separated sources are (massively) scaled up for AMI data too. I never noticed because I’ve peak-normalized them before use. The scale-invariant loss is indeed the likely culprit.

The training data was not normalized to zero mean and unit variance.

faroit commented 1 week ago

@joonaskalda thanks for the update. Maybe you can add a normalization to the pipeline so that users that aren't familiar with SI-SDR trained models aren't surprised