Closed unilight closed 5 years ago
Hi @unilight. Thank you for your question! You can listen the samples from here. Maybe as you listen, samples of wnv are almost same as raw speech. In the subjective evaluation, the feeling of STRAIGHT samples are definitely different from wnv samples, therefore, subjects tend to set low score. Futhermore, because we want to compare the performance as vocoder, the setting of feature extraction is same for both STRAIGHT and WaveNet vocoder (5ms shift, 24 order mcep). This causes the performance degradation of STRAIGHT. If we use short shift size full spectrum for STRAIGHT, the performance become better.
If my comprehension is correct, the vocoders on the MOS chart were evaluated in the condition such that the input of the vocoders were features extracted from STRAIGHT, and the output were raw waveforms. If so, then how come STRAIGHT got such low score? Shouldn't it score as high as raw waveform does?