mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.17k stars 251 forks source link

Is any generated samples available? #76

Closed mrgloom closed 5 years ago

mrgloom commented 5 years ago

Is any generated samples available?

mmorise commented 5 years ago

I'm afraid that I could not catch your point. What does the word "samples" indicate? (e.g. a generated executable file (.exe for windows) or a generated waveform) I think that you can generate them with the source code in both cases by yourself.

If I misunderstood your point, please give me detailed information.

mrgloom commented 5 years ago

By generated samples I mean audio samples in .wav format, i.e. for example here is comparision samples of different vocoders.

If threre is no samples can you point to sort of demo/example how to generate them?

Also what data WORLD vocoder take as input? (Here in the paper they wrote we employ the WORLD (Morise et al., 2016) vocoder (D4Cedition) for feature extraction and waveform synthesis.. As I understand dimension of the vocoder feature vector is 63, so they use some 63 features computed by Merlin toolkit to generate audio with WORLD vocoder? Can you shed some light what are these 63 features? Can WORLD use different feature vectors to generate audio or it's always 63-dim feature vector?

mmorise commented 5 years ago

I provide an example to generate waveform from an input file (test/test.cpp). There is no example in generated speech, but you can generate similar example by using the provided source code.

WORLD requires three speech parameters (fundamental frequency (F0), spectral envelope (SP), and aperiodicity (AP)) to generate the waveform. In full-band speech (fs is above 40 kHz), WORLD uses a 2051-dim feature vector (1 F0, 1025 SP, and 1025 AP) per frame. By using an encoder/decoder, you can use coded speech parameters such as 63-dim feature vector. Since I also provide an original codec example for SP and AP, you can reduce the number of dimensions. I recommend 56-dim vector (1, 50, 5 dimensions for F0, SP, and AP, respectively) based on my study (https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0067.PDF).