kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.54k stars 339 forks source link

Unclear signal flow related to usage of mel spectrograms in StyleMelGAN #384

Open andrewrose43 opened 1 year ago

andrewrose43 commented 1 year ago

Hello,

This is probably just a documentation problem.

It is unclear how mel spectrograms are used by the StyleMelGAN generator module.

I've been trying to figure out how to format mel spectrograms so the generator will accept them. To figure that out, I've been looking at the initialization parameters of the StyleMelGANGenerator module.

The only obvious candidate for defining the format/dimensions of the input spectrogram is the aux_channels parameter. But that wouldn't make sense, for these reasons:

1) Its default value is 80, but a mel spectrogram contains much more than 80 points of data. 2) aux_channels controls only one parameter: the in_channels parameter of the first layer in the first TADEResBlock. That would make sense if if the mel spectrograms' dimensions corresponded to this parameter, but... 3) The diagram of StyleMelGAN's signal path in the original StyleMelGan paper conflicts with point 2); the diagram shows the spectrograms being inserted into every TADEResBlock, not just the first.

So my questions are:

  1. What is aux_channels? (What kind of data is considered "auxiliary input" - am I correct that this is the spectrograms?)
  2. If aux_channels does not determine how the input spectrograms should be formatted, what does?

If you can answer these questions for me, I would be happy to improve the documentation/comments myself.

Thank you!

kan-bayashi commented 1 year ago

What is aux_channels?

The dimension of auxiliary inputs, i.e., mel-spectrogram.

If aux_channels does not determine how the input spectrograms should be formatted, what does?

I could not understand your meaning. The parameter decides the dimension of mel-spectrogram.

Its default value is 80, but a mel spectrogram contains much more than 80 points of data.

You may confuse the shape of mel-spectrogram. Mel-spectrogram shape is (#frames, #dim) and aux_channels corresponds to #dim .