sigsep / sigsep-mus-eval

museval - source separation evaluation tools for python
https://sigsep.github.io/sigsep-mus-eval/
MIT License
195 stars 36 forks source link

Your opinion on PEASS? #78

Open sevagh opened 3 years ago

sevagh commented 3 years ago

Hello,

I'm working on a project where I implement and compare various source separation algorithms. I am using PEASS: http://bass-db.gforge.inria.fr/peass/ as an evaluation (which is supposedly a perceptual evolution of BSS)

In a specific case, I noticed one algorithm gets higher PEASS scores across the board - artifact/APS, interference/IPS, target/TPS) than https://github.com/sigsep/open-unmix-pytorch, but lower on BSS scores across the board.

Has the sigsep community looked at PEASS? Compared it to BSS?

Thanks.

sevagh commented 3 years ago

If you have access to the MATLAB Wavelet Toolbox (for the CQT/ICQT functions: https://www.mathworks.com/help/wavelet/ref/cqt.html), I have written this algorithm for Harmonic/Percussive/Vocal source separation (loosely based on an iterative version of Fitzgerald soft-masking median filtering HPSS). It only works (for now) on mono wav files.

It obtains better PEASS scores than UMX (to compare with UMX, I use the MUSDB18-HQ pretrained pytorch model, and set bass+other = harmonic, drums = percussive, vocal = vocal). But it gets much worse BSSv4 scores (or even the original BSS).

Usage is HarmonicPercussiveVocal("path_to_mix.wav") to write harmonic, percussive, and vocal component files to cwd.

function HarmonicPercussiveVocal(filename, varargin)
p = inputParser;

WindowSizeP = 1024;
HopSizeP = 256;

Power = 2;

LHarmSTFT = 17;
LPercSTFT = 17;

LHarmCQT = 17;
LPercCQT = 7;

defaultOutDir = '.';

addRequired(p, 'filename', @ischar);
addOptional(p, 'OutDir', defaultOutDir, @ischar);

parse(p, filename, varargin{:});

[x, fs] = audioread(p.Results.filename);

%%%%%%%%%%%%%%%%%%%
% FIRST ITERATION %
%%%%%%%%%%%%%%%%%%%

% CQT of original signal
[cfs1,~,g1,fshifts1] = cqt(x, 'SamplingFrequency', fs, 'BinsPerOctave', 96);

cmag1 = abs(cfs1); % use the magnitude CQT for creating masks

H1 = movmedian(cmag1, LHarmCQT, 2);
P1 = movmedian(cmag1, LPercCQT, 1);

% soft masks, Fitzgerald 2010 - p is usually 1 or 2
Hp1 = H1 .^ Power;
Pp1 = P1 .^ Power;
total1 = Hp1 + Pp1;
Mh1 = Hp1 ./ total1;
Mp1 = Pp1 ./ total1;

% recover the complex STFT H and P from S using the masks
H1 = Mh1 .* cfs1;
P1 = Mp1 .* cfs1;

% finally istft to convert back to audio
xh1 = icqt(H1, g1, fshifts1);
xp1 = icqt(P1, g1, fshifts1);

%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SECOND ITERATION, VOCAL %
%%%%%%%%%%%%%%%%%%%%%%%%%%%

xim2 = xp1;

% CQT of original signal
[cfs2,~,g2,fshifts2] = cqt(xim2, 'SamplingFrequency', fs, 'BinsPerOctave', 24);

cmag2 = abs(cfs2); % use the magnitude CQT for creating masks

H2 = movmedian(cmag2, LHarmCQT, 2);
P2 = movmedian(cmag2, LPercCQT, 1);

% soft mask
Hp2 = H2 .^ Power;
Pp2 = P2 .^ Power;
total2 = Hp2 + Pp2;
Mh2 = Hp2 ./ total2;
Mp2 = Pp2 ./ total2;

% todo - set bins of mask below 100hz to 0

% recover the complex STFT H and P from S using the masks
H2 = Mh2 .* cfs2;
P2 = Mp2 .* cfs2;

% finally istft to convert back to audio
xh2 = icqt(H2, g2, fshifts2);
xp2 = icqt(P2, g2, fshifts2);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THIRD ITERATION, PERCUSSIVE %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

xim3 = xp1 + xp2;

% STFT parameters
winLen3 = WindowSizeP;
fftLen3 = winLen3 * 2;
overlapLen3 = HopSizeP;
win3 = sqrt(hann(winLen3, "periodic"));

% STFT of original signal
S3 = stft(xim3, "Window", win3, "OverlapLength", overlapLen3, ...
  "FFTLength", fftLen3, "Centered", true);

halfIdx3 = 1:ceil(size(S3, 1) / 2); % only half the STFT matters
Shalf3 = S3(halfIdx3, :);
Smag3 = abs(Shalf3); % use the magnitude STFT for creating masks

% median filters
H3 = movmedian(Smag3, LHarmSTFT, 2);
P3 = movmedian(Smag3, LPercSTFT, 1);

% binary masks with separation factor, Driedger et al. 2014
% soft masks, Fitzgerald 2010 - p is usually 1 or 2
Hp3 = H3 .^ Power;
Pp3 = P3 .^ Power;
total3 = Hp3 + Pp3;
Mp3 = Pp3 ./ total3;

% recover the complex STFT H and P from S using the masks
P3 = Mp3 .* Shalf3;

% we previously dropped the redundant second half of the fft
P3 = cat(1, P3, flipud(conj(P3)));

% finally istft to convert back to audio
xp3 = istft(P3, "Window", win3, "OverlapLength", overlapLen3,...
  "FFTLength", fftLen3, "ConjugateSymmetric", true);

% fix up some lengths
if size(xh1, 1) < size(x, 1)
    xh1 = [xh1; x(size(xh1, 1)+1:size(x, 1))];
end

if size(xp3, 1) < size(x, 1)
    xp3 = [xp3; x(size(xp3, 1)+1:size(x, 1))];
end

if size(xh2, 1) < size(x, 1)
    xh2 = [xh2; x(size(xh2, 1)+1:size(x, 1))];
    xp2 = [xp2; x(size(xp2, 1)+1:size(x, 1))];
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% FOURTH ITERATION, REFINE HARMONIC %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if size(xp3, 1) < size(x, 1)
    xp3 = [xp3; x(size(xp3, 1)+1:size(x, 1))];
end

if size(xh2, 1) < size(x, 1)
    xh2 = [xh2; x(size(xh2, 1)+1:size(x, 1))];
end

% use 2nd iter vocal estimation to improve harmonic sep
x_vocal = xh2;
x_harmonic = xh1;
x_percussive = xp3;

% CQT of harmonic signal
% use a high frequency resolution here as  well
[cfs4,~,g4,fshifts4] = cqt(x_harmonic, 'SamplingFrequency', fs, 'BinsPerOctave', 12);
[cfs4_vocal,~,~,~] = cqt(x_vocal, 'SamplingFrequency', fs, 'BinsPerOctave', 12);
[cfs4_percussive,~,~,~] = cqt(x_percussive, 'SamplingFrequency', fs, 'BinsPerOctave', 12);

cmag4 = abs(cfs4); % use the magnitude CQT for creating masks
cmag4_vocal = abs(cfs4_vocal);
cmag4_percussive = abs(cfs4_percussive);

% soft masks, Fitzgerald 2010 - p is usually 1 or 2
H4 = cmag4 .^ Power;
V4 = cmag4_vocal .^ Power;
P4 = cmag4_percussive .^ Power;
total4 = H4 + V4 + P4;
Mh4 = H4 ./ total4;

H4 = Mh4 .* cfs4;

% finally istft to convert back to audio
xh4 = icqt(H4, g4, fshifts4);

[~,fname,~] = fileparts(p.Results.filename);
splt = split(fname, "_");
prefix = splt{1};

% fix up some lengths
if size(xh4, 1) < size(x, 1)
    xh4 = [xh4; x(size(xh4, 1)+1:size(x, 1))];
end

xhOut = sprintf("%s/%s_harmonic.wav", p.Results.OutDir, prefix);
xpOut = sprintf("%s/%s_percussive.wav", p.Results.OutDir, prefix);
xvOut = sprintf("%s/%s_vocal.wav", p.Results.OutDir, prefix);

audiowrite(xhOut, xh4, fs);
audiowrite(xpOut, xp3, fs);
audiowrite(xvOut, xh2, fs);
end
aliutkus commented 3 years ago

Hi Nice This actually makes me think of the work @jonathandriedger did back in the days !

Concerning PEASS, long story short: it's super slow, trained on antediluvean data and would need a serious update, but the idea is nice. It was never really adopted mostly due to its slowness

sevagh commented 3 years ago

Yes, it's inspired by his algorithm (2-pass with large window + small window), and also Fitzgerald later revisited and added a multipass version with CQT for voice separation: https://arrow.tudublin.ie/cgi/viewcontent.cgi?article=1007&context=argart