Closed haydenshively closed 1 year ago
hey Hayden, thanks for raising this
why are you comparing the discriminator with the one from MelGAN? Soundstream has no relationship with that paper afaict?
SoundStream Section III.D
For the wave-based discriminator, we use the same multiresolution convolutional discriminator proposed in [15] and adopted in [45]
[15] is MelGAN and [45] is SEANet. SEANet refers readers back to MelGAN for discriminator architecture details, so I went with that.
@haydenshively i believe you are right
thank you! i've updated it in 1.1.0
; do let me know if you see any other discrepancies
Spent some time comparing this
MultiScaleDiscriminator
with the SoundStream paper, as well as the official MelGAN implementation (cited in SoundStream). A few small differences:init_conv
uses kernel size 7, should be 15stride * 10 + 1
). This feels huge but checks out in both their code and Appendix A of the paperfinal_conv
has 2 Conv1d layers with kernel size 3 and 1 respectively, while MelGAN uses 5 and 3I don't think any of these are a big deal, but wanted to share for the sake of completeness.