Audio downsampling for MultiScaleDiscriminators

ilya16 commented 1 year ago

Hi, Phil!

Currently, audio downsampling for MultiScaleDiscriminators is performed using F.interpolate(audio) with default mode = 'nearest'. This means that only a subset of the original audio values appear in the downsampled audio tensors used by the discriminators (e.g. only even values for 2-times downsampling). Thus, the gradients do not propagate through the entire audio.

am not sure what the best replacement is. In MelGAN and HiFiGAN-based vocoders, which widely use MultiScaleDiscriminators, the common solution is AvgPool1d(kernel_size=4, stride=2, padding=2) for 2-times downsampling.

lucidrains commented 1 year ago

@ilya16 this is such a keen observation! thank you!

turian commented 1 year ago

@lucidrains On the topic I would propose a MultiPeriodDiscriminator too. For me a little more formally motivated than multiscale.

turian commented 1 year ago

https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/models.py#L227-L230

I guess it's worth noting that BigVGAN tried all three and landed on multiscale STFT (multi-resolution discrimination, MRD) and MPD better than MSD:

lucidrains / audiolm-pytorch

Audio downsampling for MultiScaleDiscriminators #92