Closed ilya16 closed 1 year ago
@ilya16 this is such a keen observation! thank you!
@lucidrains On the topic I would propose a MultiPeriodDiscriminator too. For me a little more formally motivated than multiscale.
https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/models.py#L227-L230
I guess it's worth noting that BigVGAN tried all three and landed on multiscale STFT (multi-resolution discrimination, MRD) and MPD better than MSD:
Hi, Phil!
Currently, audio downsampling for
MultiScaleDiscriminators
is performed usingF.interpolate(audio)
with defaultmode = 'nearest'
. This means that only a subset of the original audio values appear in the downsampled audio tensors used by the discriminators (e.g. only even values for 2-times downsampling). Thus, the gradients do not propagate through the entire audio.am not sure what the best replacement is. In MelGAN and HiFiGAN-based vocoders, which widely use
MultiScaleDiscriminators
, the common solution isAvgPool1d(kernel_size=4, stride=2, padding=2)
for 2-times downsampling.