difference reason - Githubissues

generalwave commented 4 years ago

1、bn： nn.BatchNorm2d(out_channels, eps=1e-3, momentum=0.01) 2、padding：left top first in pytorch，right bottom first in tensorflow or keras `class Conv2dKeras(nn.Conv2d): def init(self, in_channels, out_channels, kernel_size, stride=1, padding='same', dilation=1, groups=1, bias=True, padding_mode='zeros'): super(Conv2dKeras, self).init( in_channels, out_channels, kernel_size, stride, 0, dilation, groups, bias, padding_mode) self.keras_mode = padding

def _padding_size(self, size, idx):
    output = (size[idx] + self.stride[idx] - 1) // self.stride[idx]
    padding = (output - 1) * self.stride[idx] + (self.kernel_size[idx] - 1) * self.dilation[idx] + 1 - size[idx]
    padding = max(0, padding)
    return padding

def forward(self, x):
    if self.keras_mode == 'same':
        size = x.shape[2:]
        row = self._padding_size(size, 0)
        col = self._padding_size(size, 1)
        x = functional.pad(x, [floor(col / 2), ceil(col / 2), floor(row / 2), ceil(row / 2)])

    return super(Conv2dKeras, self).forward(x)`

james34602 commented 4 years ago

@tuan3w The only fatal error of your implementation is concatenation. https://github.com/deezer/spleeter/blob/39af9502ab1156c013f17f8d8cd1c53d46459857/spleeter/model/functions/unet.py#L127 Each U-Net encoder convolutional layer output is being concated with decoder output. We are not concatenating the encoder batch norm or activation output.

Minor issue to solve:

Batch normalization set to 1e-3.
Leaky ReLU alpha is 0.2 in official Spleeter, not 0.3
4 stems model change all the encoder and decoder activation to exponential ReLU.

Here I got my implementation of Spleeter in C correct with a VST demo: https://github.com/james34602/SpleeterRT/blob/master/Source/spleeter.c

@generalwave I don't think the problem is about CNN padding, no?

tuan3w commented 4 years ago

Thanks @james34602 and @generalwave.

The quality of output seems better now. However, I still see some differences in waveform output. Not sure due to some bug or differences in preprocessing step.

james34602 commented 4 years ago

@tuan3w What's the MSE/MAE of output mask between your output and official Spleeter(Tensorflow)? If the mask function is identical or similar (1e-3), then you are implement absolutely correct. You don't have to care the differences cause by minor processing.

tuan3w commented 4 years ago

Hi @james34602 , Here the spectrogram by output audios.

The top one is from my implementation, the bottom is from spleeter. As you can see, the audio generated by spleeter seems has litter noise at high frequencies than mine.

james34602 commented 4 years ago

Recently is busy on my projects, may be help you to find remaining bugs in the future.

generalwave commented 4 years ago

@james34602 padding 在pytorch和tensorflow的不同，影响还是挺大，如果从头训练没问题，但是模型来自模型转换的部分还需按照原来的，CNN和转置CNN中padding的方式和pytorch都不一致，都需要改动，为方便参考，刚提交了我的pytorch实现，训练部分和预测部分有和原始文件有些许不同。 https://github.com/generalwave/spleeter.pytorch

james34602 commented 4 years ago

@generalwave 据我经验Tensorflow和Matlab的Padding几乎无别。至于Pytorch和Tensorflow的区别，除了Padding='same'外的特例我不知道。我试过将SRGAN Pytorch的CNN系数转到Matlab里，两者预测的结果是一致。就算Tensorflow和Pytorch的Padding不一样，理论上完全能预补零解决。 Spleeter官方没公开训练集，重头训练并匹配原论文的结果是没可能。

generalwave commented 4 years ago

pytorch和matlab一致，应该是图大小和padding方式正好的缘故。卷积核不对称的话，补零是不一致的。我说的不是一定要模型参数和spleeter一致，而是用pytorch的padding方式，需从头训练，效果可以spleeter能一致。

james34602 commented 4 years ago

可能SRGAN那方的Padding刚好导致输入输出大小一样，所以和Matlab的'same'无别，所以结果吻合。个人在C实现TF或Pytorch的CNN都没问题，设好stride, padding, dilation和offset，然后送去im2col(),gemm()就ok

tuan3w / spleeter-pytorch

difference reason #2