mel spectrum misconception

mmorise / World

A high-quality speech analysis, manipulation and synthesis system

http://www.kisc.meiji.ac.jp/~mmorise/world/english

Other

1.19k stars 255 forks source link

mel spectrum misconception #90

Closed Dannynis closed 5 years ago

Dannynis commented 5 years ago

Dear mmorise, in the frame encoding function it seems like you used a simple linear interpolation (using matlab function) to get the mel spectrum, my understanding tells me that mel spectrum would require triangular filters on the stft of the signal instead. am I wrong ?

mmorise commented 5 years ago

Triangular filters are used to calculate MFCCs (mel-frequency cepstral coefficients). However, speech synthesis generally requires mel-cepstrum instead of MFCCs. Mel-cepstrum does not require triangular filters. Strictly speaking, my implementation is different from the traditional implementation in mel-cepstrum, but the sound quality is almost the same.

Sleepwalking commented 5 years ago

On a side note: traditionally MFCCs and Mel-ceps are defined for spectrum, not spectral envelope. World's coder works on spectral evenlopes (already smooth) so there's no need to smooth stuffs out again using triangle filter or cepstral lowpass. In that case, result-wise, simple spectral downsampling is not too different from the traditional way of Mel-cepstral analysis (which is a lot more complicated).

Dannynis commented 5 years ago

Thank you for your respone ! My follow up question is what definition do you use for mel cpestrum? Is it "as the inverse Fourier transform of the generalized logarithmic spectrum calculated on a warped frequency scale" as mentioned in the paper in Sleepwalking's comment? Also whats the diffrance between the spectrogram the Cheaptrick outputs, and the discussed algorithm outputs? also when i do the spectral envelope process and immediately do the inverse process (decode_spectral_envelope) i recieve a bit corrupted audio, which makes sense due to the lost information when taking the bins, why doesnt taking more bins increase the audio quality ?

Sleepwalking commented 5 years ago

I believe in Tokuda’s paper Mel-cepstrum is as an approximation (on a warped axis) of the power spectral density with some limited cepstrum order that minimizes loss of spectral details according to Itakura-Saito distance. I don’t think WORLD has such an analytical definition. The output from WORLD doesn’t carry very accurate mathematical meaning but is more like a characterization of the shape itself.