@r9y9 I was trying to understand the wavenet-vocoder implementation and some of the layer dimensions didn't seem to match based on what I understood from the wavenet paper.
Could you shed light on some of these dimensions as may be I'm missing something?
1) The first_conv layer shows as shaped 1x512. Isn't the input to this layer the mel-spectrum frames, which are 80 float values * 2,500 so the in_channels for this conv1d layer should be 80 instead of 1? Why is the output-channels 512?
(as 2,500 is the max decoder steps defined as max_iters in hparams.py)
2) Isn't the input to the wavenet mel-frames of 80 floats? Why is input_type listed as input_type="raw"?
3) Why is the input-channels in (conv1x1c) below 80 and (conv1x1_out) 256? There doesn't seem to anything generating 256-d inputs for (conv1x1_out). What exactly is their inputs? (e.g. is it mel-spectrum frames) Is the wavenet-vocoder generating just 1 float value per 1 input mel-spectrum frame of 80 floats?
@r9y9 I was trying to understand the wavenet-vocoder implementation and some of the layer dimensions didn't seem to match based on what I understood from the wavenet paper.
Could you shed light on some of these dimensions as may be I'm missing something?
1) The first_conv layer shows as shaped 1x512. Isn't the input to this layer the mel-spectrum frames, which are 80 float values * 2,500 so the in_channels for this conv1d layer should be 80 instead of 1? Why is the output-channels 512?
(as 2,500 is the max decoder steps defined as max_iters in hparams.py)
2) Isn't the input to the wavenet mel-frames of 80 floats? Why is input_type listed as
input_type="raw"
?3) Why is the input-channels in (conv1x1c) below 80 and (conv1x1_out) 256? There doesn't seem to anything generating 256-d inputs for (conv1x1_out). What exactly is their inputs? (e.g. is it mel-spectrum frames) Is the wavenet-vocoder generating just 1 float value per 1 input mel-spectrum frame of 80 floats?
The whole network shows as the following when I print it: