Code/decode spectral envelope causes more perceptual degradation than SPTK-based mel-cepstrum parametrizatioin and its inverse transform

r9y9 commented 7 years ago

I tried to reconstruct spectral envelope from coded one using WORLD and noticed that it results in perceptually degraded speech even if no modification on the coded spectral enevlope. In my option, I fell generated speech loses speaker identity. When I used SPTK-based mel-cepstrum representation and its inverse transform, there's no such a degradation, so I'd expect it would be using WORLD as well. I tried a few speech examples (male/female) and it seems that the problem is not speaker/gender specific.

To illustrate the problem, I created a notebook that investigates the difference. You can listen the generated audio examples at: http://nbviewer.jupyter.org/gist/r9y9/ca05349097b2a3926ec77a02e62c6632

Some links:

Synthesized audio by WORLD code/decode spectral envelope: http://nbviewer.jupyter.org/gist/r9y9/ca05349097b2a3926ec77a02e62c6632#1.-Synthesis-from-coded-spectral-envelope-by-WORLD
Synthesized audio by SPTK-basd mcep and its inverse transform: http://nbviewer.jupyter.org/gist/r9y9/ca05349097b2a3926ec77a02e62c6632#2.-Synthesis-from-mel-cepstrum-using-pysptk

Before further investigating the problem (e.g. try more broad examples, etc), could you (or anybody) tell me whether I'm doing correct? The code in the notebook is written in python (needs https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder/pull/8 and mater branch of pysptk), but I have a C++ code as well to reproduce the same issue locally. Let me know if you need more code / information.

m-toman commented 7 years ago

Hi,

I fear I can't help with the issue itself... What I did is integrate the MGC2SP function from SPTK in my C++ code, using the same parameters as in https://github.com/CSTR-Edinburgh/merlin/blob/master/src/utils/generate.py#L277

This works well for me, except that it's really slow. Actually slower that the DNN prediction, slower than MLPG and slower than the vocoding.

Did you look into the performance issue too?

m-toman commented 7 years ago

Oh, in the feature extraction merlin also uses SPTK for the conversion https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/vocoder/world/extract_features_for_merlin.sh#L65 If this is interesting for you ;).

r9y9 commented 7 years ago

No, I haven't look into performance issues yet. I'm surprised that it could be slower than DNN predictions..

m-toman commented 7 years ago

Here an example (prediction done using Eigen with SSE2 enabled):

Synthesizing "This is a"
[DNN-PROFILER] Prediction time for single phone: 14ms.
[DNN-PROFILER] Prediction time for single phone: 23ms.
[DNN-PROFILER] Prediction time for single phone: 15ms.
[DNN-PROFILER] Prediction time for single phone: 14ms.
[DNN-PROFILER] Prediction time for single phone: 19ms.
[DNN-PROFILER] Prediction time for single phone: 31ms.
[DNN-PROFILER] Time for BAP MLPG: 11ms.
[DNN-PROFILER] Time for LF0 MLPG: 3ms.
[DNN-PROFILER] Time for MGC MLPG: 105ms.
[DNN-PROFILER] Time for acoustic feature transformation: 224ms.
[DNN-PROFILER] Time for vocoding: 59ms.

The predictions sum up to 117ms. "acoustic feature transformation" also contains LF0 and BAP but those are cheap. Sorry for hijacking, but therefore I'm also interested in a different implementation than the one in SPTK ;).

EDIT: That's a 6x512 units feedforward NN

mmorise commented 7 years ago

I apologize for the late reply.

The change of voice timbre in using WORLD codec was a bug. This problem is causes by using speech with fs of under 40 kHz.

Since I fixed this problem, please check the new program.

r9y9 commented 7 years ago

Thank you very much! The issue seems to be fixed now.

mmorise / World

Code/decode spectral envelope causes more perceptual degradation than SPTK-based mel-cepstrum parametrizatioin and its inverse transform #33