Closed zhao2zhang closed 5 years ago
All the hidden features and signals in the source and filter modules have the same length as the waveform o_1:T. However, F0 and spectral features are extracted every frame and have only B frames (i.e., f_1:B). The condition module needs to upsample the f_1:B to \tilde{f}_1:T.
Upsampling is quite straightforward: just copying the value of f_b multiple times. Suppose waveform sampling rate is 16kHz, frame shift is 5ms (one frame every 5ms). Then, each f_b must be replicated for 16 * 5 = 80 times.
Upsampling in math: \tilde{f}t = f{t/80}, where t/80 is the floor division (e.g., 2/3 = 0, 4/3=1) Upsampling in picture: page 13 of https://www.slideshare.net/akiratamamori/speaker-dependent-wavenet-vocoder
(I realize that "duplicating to every time step within the b-th frame" may be misleading if we consider the overlap of the framing window. )
Said in the paper ‘condition module upsamples the F0 by duplicating fb to every time step within the b-th frame’ .Can you explain in detail how the upsampling here is done? What is the time step? thank you very much