Closed r9y9 closed 7 years ago
Hi,
I fear I can't help with the issue itself... What I did is integrate the MGC2SP function from SPTK in my C++ code, using the same parameters as in https://github.com/CSTR-Edinburgh/merlin/blob/master/src/utils/generate.py#L277
This works well for me, except that it's really slow. Actually slower that the DNN prediction, slower than MLPG and slower than the vocoding.
Did you look into the performance issue too?
Oh, in the feature extraction merlin also uses SPTK for the conversion https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/vocoder/world/extract_features_for_merlin.sh#L65 If this is interesting for you ;).
No, I haven't look into performance issues yet. I'm surprised that it could be slower than DNN predictions..
Here an example (prediction done using Eigen with SSE2 enabled):
Synthesizing "This is a"
[DNN-PROFILER] Prediction time for single phone: 14ms.
[DNN-PROFILER] Prediction time for single phone: 23ms.
[DNN-PROFILER] Prediction time for single phone: 15ms.
[DNN-PROFILER] Prediction time for single phone: 14ms.
[DNN-PROFILER] Prediction time for single phone: 19ms.
[DNN-PROFILER] Prediction time for single phone: 31ms.
[DNN-PROFILER] Time for BAP MLPG: 11ms.
[DNN-PROFILER] Time for LF0 MLPG: 3ms.
[DNN-PROFILER] Time for MGC MLPG: 105ms.
[DNN-PROFILER] Time for acoustic feature transformation: 224ms.
[DNN-PROFILER] Time for vocoding: 59ms.
The predictions sum up to 117ms. "acoustic feature transformation" also contains LF0 and BAP but those are cheap. Sorry for hijacking, but therefore I'm also interested in a different implementation than the one in SPTK ;).
EDIT: That's a 6x512 units feedforward NN
I apologize for the late reply.
The change of voice timbre in using WORLD codec was a bug. This problem is causes by using speech with fs of under 40 kHz.
Since I fixed this problem, please check the new program.
Thank you very much! The issue seems to be fixed now.
I tried to reconstruct spectral envelope from coded one using WORLD and noticed that it results in perceptually degraded speech even if no modification on the coded spectral enevlope. In my option, I fell generated speech loses speaker identity. When I used SPTK-based mel-cepstrum representation and its inverse transform, there's no such a degradation, so I'd expect it would be using WORLD as well. I tried a few speech examples (male/female) and it seems that the problem is not speaker/gender specific.
To illustrate the problem, I created a notebook that investigates the difference. You can listen the generated audio examples at: http://nbviewer.jupyter.org/gist/r9y9/ca05349097b2a3926ec77a02e62c6632
Some links:
Before further investigating the problem (e.g. try more broad examples, etc), could you (or anybody) tell me whether I'm doing correct? The code in the notebook is written in python (needs https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder/pull/8 and mater branch of pysptk), but I have a C++ code as well to reproduce the same issue locally. Let me know if you need more code / information.