Closed Dannynis closed 5 years ago
Triangular filters are used to calculate MFCCs (mel-frequency cepstral coefficients). However, speech synthesis generally requires mel-cepstrum instead of MFCCs. Mel-cepstrum does not require triangular filters. Strictly speaking, my implementation is different from the traditional implementation in mel-cepstrum, but the sound quality is almost the same.
On a side note: traditionally MFCCs and Mel-ceps are defined for spectrum, not spectral envelope. World's coder works on spectral evenlopes (already smooth) so there's no need to smooth stuffs out again using triangle filter or cepstral lowpass. In that case, result-wise, simple spectral downsampling is not too different from the traditional way of Mel-cepstral analysis (which is a lot more complicated).
Thank you for your respone ! My follow up question is what definition do you use for mel cpestrum? Is it "as the inverse Fourier transform of the generalized logarithmic spectrum calculated on a warped frequency scale" as mentioned in the paper in Sleepwalking's comment? Also whats the diffrance between the spectrogram the Cheaptrick outputs, and the discussed algorithm outputs? also when i do the spectral envelope process and immediately do the inverse process (decode_spectral_envelope) i recieve a bit corrupted audio, which makes sense due to the lost information when taking the bins, why doesnt taking more bins increase the audio quality ?
I believe in Tokuda’s paper Mel-cepstrum is as an approximation (on a warped axis) of the power spectral density with some limited cepstrum order that minimizes loss of spectral details according to Itakura-Saito distance. I don’t think WORLD has such an analytical definition. The output from WORLD doesn’t carry very accurate mathematical meaning but is more like a characterization of the shape itself.
Dear mmorise, in the frame encoding function it seems like you used a simple linear interpolation (using matlab function) to get the mel spectrum, my understanding tells me that mel spectrum would require triangular filters on the stft of the signal instead. am I wrong ?