About f0 upsampling - Githubissues

About upsampling, see the comments below
Time step denotes the waveform time step t

All the hidden features and signals in the source and filter modules have the same length as the waveform o_1:T. However, F0 and spectral features are extracted every frame and have only B frames (i.e., f_1:B). The condition module needs to upsample the f_1:B to \tilde{f}_1:T.

Upsampling is quite straightforward: just copying the value of f_b multiple times. Suppose waveform sampling rate is 16kHz, frame shift is 5ms (one frame every 5ms). Then, each f_b must be replicated for 16 * 5 = 80 times.

Upsampling in math: \tilde{f}t = f{t/80}, where t/80 is the floor division (e.g., 2/3 = 0, 4/3=1) Upsampling in picture: page 13 of https://www.slideshare.net/akiratamamori/speaker-dependent-wavenet-vocoder

(I realize that "duplicating to every time step within the b-th frame" may be misleading if we consider the overlap of the framing window. )

nii-yamagishilab / project-CURRENNT-public

About f0 upsampling #3