`MultiScaleDiscriminator` differs from paper

haydenshively commented 1 year ago

Spent some time comparing this MultiScaleDiscriminator with the SoundStream paper, as well as the official MelGAN implementation (cited in SoundStream). A few small differences:

init_conv uses kernel size 7, should be 15
each of the 4 grouped convolutions has kernel size 8, while MelGAN uses 41 (stride * 10 + 1 ). This feels huge but checks out in both their code and Appendix A of the paper
final_conv has 2 Conv1d layers with kernel size 3 and 1 respectively, while MelGAN uses 5 and 3

I don't think any of these are a big deal, but wanted to share for the sake of completeness.

lucidrains commented 1 year ago

hey Hayden, thanks for raising this

why are you comparing the discriminator with the one from MelGAN? Soundstream has no relationship with that paper afaict?

haydenshively commented 1 year ago

SoundStream Section III.D

For the wave-based discriminator, we use the same multiresolution convolutional discriminator proposed in [15] and adopted in [45]

[15] is MelGAN and [45] is SEANet. SEANet refers readers back to MelGAN for discriminator architecture details, so I went with that.

lucidrains commented 1 year ago

@haydenshively i believe you are right

thank you! i've updated it in 1.1.0; do let me know if you see any other discrepancies

lucidrains / audiolm-pytorch

`MultiScaleDiscriminator` differs from paper #194