sh-lee-prml / BigVGAN

Unofficial pytorch implementation of BigVGAN: A Universal Neural Vocoder with Large-Scale Training
MIT License
130 stars 16 forks source link

Why did you use low-pass filter twice in AMPBlock? #6

Open hoyden opened 2 years ago

hoyden commented 2 years ago

https://github.com/sh-lee-prml/BigVGAN/blob/37e49f36e50134de45b407bf2c6b1a61cea09329/models_bigvgan.py#L66-L69

Is these necessary? I didn't find it in paper.

sh-lee-prml commented 2 years ago

I refer to Appendix A (page 13) of BigVGAN paper.

The upsampled feature is followed by M number of AMP residual blocks, where each AMP block uses different kernel sizes for a stack of dilated 1D convolutions defined as ki,j (j = {1, . . . , M}). --> M =3 [3, 7, 11]

The j-th AMP block contains L number of the anti-aliased periodic activation and the dilated 1D convolution using a dilatation rate of di,j,l(l = {1, . . . , L}). --> L = 6 (3x2) [[1,1], [3,1], [5,1]]

So I used low-pass filter twice in this module.

But, I'm not sure it is same with what the author intended. When I first implemented this part, I was also confused because the figure was not matched with the hyperparameter (Specifically, dilation rates).

hoyden commented 2 years ago

Thanks for your reply. But I think L = 3, it's a common practice to add an extra dilated 1D convolution layer (d = 1). Maybe we should think of two layers of dilated convolution as a whole, and just apply low-pass filter at input of the whole module. It's just my opinion ^^

sh-lee-prml commented 2 years ago

Thank you for your concern about this issue.

I agree your idea about using a low-pass filter only once. However, in this case, I was confused about the activation function between these dilated convs.

The original HiFi-GAN used a leaky-relu for activation function in this part. However, in this paper, BigVGAN replaced it with snake1d. Hence, I implemented this part just by using snake and resamping with low-pass filter twice.

I just think that it would have been nice if there are additional ablation studies about this part in this paper.

I'm not sure that which implementation is the best in this cases. I hope that the authors of BigVGAN adress these issues with more ablation studies...

hoyden commented 2 years ago

Yeah, you're right. I'll try to use snake1d directly between two dilated convs. And I will tell you the results of my experiment. Maybe someday NVIDIA will open source their work. ^^

sh-lee-prml commented 2 years ago

After some comparison, I found that both models (using resampling once or twice) have similar performance.

But when using twice, training/inference speed is much slower so I changed it as you mentioned

Thank you~