Open faroit opened 1 week ago
Hi @faroit, thank you for your interest in PixIT! I suspect the issue is that the current version is trained only on the AMI meeting dataset. On the AMI test set this hasn’t been an issue. Finetuning on domain-specific audio would likely improve the separation performance.
@joonaskalda thanks for your reply. I am not sure if fine-tuning would really be able to fix any of this.
I digged a bit deeper and saw that the maximum output after separation is about 81.0
in that example. Also interesting is that it also drifts in terms of bias. Here is the peak-normalized output of speaker 1
Was the model trained on zero-mean, unit variance data?
Thanks for investigating. I checked and the separated sources are (massively) scaled up for AMI data too. I never noticed because I’ve peak-normalized them before use. The scale-invariant loss is indeed the likely culprit.
The training data was not normalized to zero mean and unit variance.
@joonaskalda thanks for the update. Maybe you can add a normalization to the pipeline so that users that aren't familiar with SI-SDR trained models aren't surprised
Tested versions
System information
macOS, m1
Issue description
Hi @hbredin, @joonaskalda thanks for this great release!
I tried some examples on the new pixit pipeline and I find outputs of the separation module seem to produce a very high level of clipping. Is this to be expected from the way it was trained with scale-invariant losses?
Input was a downsampled 16khz mono wav file from the youtube excerpt linked below.
Minimal reproduction example (MRE)
https://www.youtube.com/watch?v=CGUpPyA48jE&t=182s