pheepa / DCUnet

Phase-aware speech enchancement with Deep Complex U-Net
94 stars 19 forks source link

PESQ value of my reproduced model(3 epoch) is only 2.0438 instead of 2.818 #6

Open Xu-Kaibo opened 1 year ago

Xu-Kaibo commented 1 year ago

PESQ value of my reproduced model(3 epoch) is only 2.0438 instead of 2.818 from Mr.Filippov's experiment. I followed the steps of the provided code. But the function of calculating PESQ value can't run, so I modified it a little. The PESQ value of testset itself is 1.9306. And after denoising by the model trained after 3 epoches, the PESQ is just 2.0438. I'm so confused about where's the wrong. My PESQ calculation code is pasted below: " def metrics_score(mode, net, test_loader):

Calculate mode: "Testset"/"TestModel",

# calculate the metrics of the noisy samples of testset OR valuate the performance of model
# if the function is in the "Testset" mode, pass anything you want to "net"
# Considered metrics: PESQ
print("Measuring Metrics for",mode)
if mode=="TestModel":
    net.eval()
test_pesq = 0.
counter = 0.

for noisy_x, clean_x in tqdm(test_loader):
    # get the output from the model
    noisy_x = noisy_x.to(DEVICE)

    clean_x = torch.squeeze(clean_x, 1)
    clean_x = torch.istft(clean_x, n_fft=N_FFT, hop_length=HOP_LENGTH, normalized=True)

    pesq_a = 0.
    if mode=="Testset":
        noisy_x = torch.squeeze(noisy_x, 1)
        noisy_x = torch.istft(noisy_x, n_fft=N_FFT, hop_length=HOP_LENGTH, normalized=True)
        for i in range(len(clean_x)): # speech may be in the form of [d,n] instead of [1,n]
            clean_x_16 = down_sample(clean_x[i, :].view(1, -1), 48000, 16000)
            noisy_x_16 = down_sample(noisy_x[i, :].view(1, -1), 48000, 16000)        
            clean_x_16 = clean_x_16.cpu().numpy().flatten()
            noisy_x_16 = noisy_x_16.detach().cpu().numpy().flatten()

            pesq_a += pesq.pesq(16000, clean_x_16, noisy_x_16, 'wb')

    elif mode=="TestModel":
        with torch.no_grad():
            pred_x = net(noisy_x)
        for i in range(len(clean_x)):
            clean_x_16 = down_sample(clean_x[i, :].view(1, -1), 48000, 16000)
            pred_x_16 = down_sample(pred_x[i, :].view(1, -1), 48000, 16000)        
            # I cannot run the Resample function below
            # clean_x_16 = torchaudio.transforms.Resample(48000, 16000)(clean_x[i, :].view(1, -1))
            # pred_x_16 = torchaudio.transforms.Resample(48000, 16000)(pred_x[i, :].view(1, -1))
            clean_x_16 = clean_x_16.cpu().numpy().flatten()
            pred_x_16 = pred_x_16.detach().cpu().numpy().flatten()

            pesq_a += pesq.pesq(16000, clean_x_16, pred_x_16, 'wb')

    pesq_a /= len(clean_x)
    test_pesq += pesq_a
    counter += 1

test_pesq /= counter
return test_pesq

"

JalaJalera commented 1 year ago

Hey, I'm working with this model right now. There is nothing ready to be published yet, but I found some problems in this implementation. But the thing that will fix the bad pesq is to use a right stft window: Here they use a rectangle window. You can pass a better suited window like this:

clean_stft = torch.stft(input=clean_sample, n_fft=self.n_fft, window=torch.hann_window(self.n_fft, True),
                                  hop_length=self.hop_length, normalized=True, return_complex=False)

In this case I used a "hann" window. Some other flaws:

  1. The complex Batchnorm is not wrong but not really right either. You can find a better implementation here (https://github.com/wavefrontshaping/complexPyTorch)
  2. All the waveform are restricted to 3.5 seconds although some of the samples are longer so the data is not used at all
  3. The model is trained with 48 kHz samples, although the pesq uses only 16 kHz

Hope this helped and I will share my code when it's "presentable"