mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.17k stars 251 forks source link

Doubt: what is the relation of timestep to the first dimension of ap and sp #66

Closed Edresson closed 6 years ago

Edresson commented 6 years ago

Hello, I'm trying to design a neural network that, from a Mel spectrogram, predicts the f0, sp, ap parameters for the World Vocoder. To design it, I need to know what the relation function of the first dimension of the Honey extraction is with the first dimension of the sp and the ap. Below are the features shapes extracted from 4 different audio files of f0, sp, ap and Mel spectrogram.

Sample 1:

Mel spectrogram shape = (446, 80)

f0 shape = (1112,) ap shape = (1112, 513) sp shape = (1112, 513)

Sample 2:

Mel spectrogram shape = (375, 80)

f0 shape = (935,) ap shape = (935, 513) sp shape = (935, 513)

Sample 3:

Mel spectrogram shape = (338, 80)

f0 shape = (842,) ap shape = (842, 513) sp shape = (842, 513)

Sample 4:

Mel spectrogram shape = (781, 80)

f0 shape = (1948,) ap shape = (1948, 513) sp shape = (1948, 513)

You know the relation function between the first dimension of ap and sp with the first dimension of the Mel spectrogram.

Example: In sample 1: What is the relationship function between dimension 448 (Mel spectrogram) and 1112 (ap, sp)?

I know that here may not be the best place for the question, but this information is extremely important to me.

My idea is to predict f0, ap, sp from Mel spectrogram making it much easier to integrate World vocoder with Text-to-Speech models.

Thank you in advance for your attention.

lmaxwell commented 6 years ago

what is the value of frame shift used for extracting mel spectrogram features? ` 1112/446 2.493273542600897

935/375 2.493333333333333

1948/781 2.4942381562099873 `

So I guess it is 5ms * 2.5 = 12.5 ms. when you extracting Mel spectrogram and world features, several frames at the end of the sentences may be removed.

To train your network, you just need to remove(suggested) or pad(zero) frames to make sure the relation between dimensions of the two is 2.5.

For example: Sample 1: 1110 444 Sample 2: 935 374

Edresson commented 6 years ago

Hello, thanks for the suggestion, frame shift is 0.0125 seconds. Below the information used to extracting mel spectrogram features: sr = 22050 # Sampling rate. n_fft = 2048 # fft points (samples) frame_shift = 0.0125 # seconds frame_length = 0.05 # seconds hop_length = int (sr frame_shift) # samples. = 276. win_length = int (sr frame_length) # samples. = 1102. n_mels = 80 # Number of Mel banks to generate power = 1.5 # Exponent for amplifying the predicted magnitude n_iter = 50 # Number of inversion iterations preemphasis = .97 max_db = 100 ref_db = 20

I did not want to remove or zero pad world features, however as best solution I found was zero pad world features and also reduce the mel spectrogram. After predicting the world features with zeros pad, I "unpad" and then synthesize normally. Thanks for your help.

thank you so much.