Closed QEDan closed 4 years ago
I've confirmed this issue in this library using the test.cpp file and setting the fft_size
there. While I was in there, I saw the "Important notice (2017/01/02)" in the comments about changing fft_size
and how it impacts other parameters such as f0_floor
. It isn't clear exactly how these parameters impact one another in the code, so I appreciate any guidance on working with small fft_size
s and large frame_periods
.
Hi, the recommended fft_size
is a large number such as 2048;
it has to be at least larger than frame_period * fs / 1000
(i.e., frame period
in sample points).
In the two combinations you presented, fft_size
is way smaller than the threshold,
and could have violated some of the presumptions in WORLD vocoder.
Could you explain the reason that you set fft_size
to as small as 64 when fs=16000
?
As your setting is not recommended for high quality speech analysis/reconstruction,
I would like to hear the reason so that we can help.
Hi. We are using the vocoder in deep learning and we want to keep the size of the acoustic parameters small. For us, there is a balance between high quality synthesis and having a lower dimensionality space that's easier for our models to learn. I do not necessarily need to go to fft_size=64
, but this was the number where I could demonstrate the problem reliably with a simple script. I have more intermittent problems with the memory at fft_size=256
or even fft_size=512
. Usually things work reliably at fft_size=1024
or larger.
I am building on the work done in this paper: https://arxiv.org/pdf/1904.01537.pdf
The authors used World via Merlin for their research. They don't share the details of what parameters they used for World, but they do share the number of parameters used for the encoding: "There are 60 features for spectral en-velope, 5 for band aperiodicity, 1 for F0 and a boolean flag for the voiced/unvoiced decision" The defaults for World tend to produce much larger parameter spaces than what these authors were using.
I have reached out to them to get more details and am still hoping to hear back.
I appreciate your help. Please let me know if you can offer any insights or guidance.
Thanks for clarification.
Conventionally, the low-dimensional spectral features are all computed from high-dimensional spectrograms. One of the most commonly adopted low-dim spectral feature is log Mel spectrogram (as was used in the paper you mentioned). Mel spec was extracted from high-dim features (such as WORLD sp
). Thus, you can extract sp
and then apply Mel warping to reduce the dimension to 40 - 120.
Many DSP packages has linear-spectogram to mel-spectrogram conversion. (librosa, tensorflow, etc.)
Similarly, ap
is usually replaced with band aperiodicity bap
, with dimension of 5 or so.
It's also common to train a dimension reduction mapping using NNs. (Simply feed the spctrograms and apply convolutional stack along the frequency axis.)
The rest is left to you to explore.
I am using the python wrapper (https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder). I posted an issue in that repo as well, but I think the issue may be in this c++ library. At least, I would be interested to hear thoughts about the issue.
I get segmentation faults when using non-default combinations of
frame_period
andfft_size
. For example, the following script results in a segmentation fault:I get the same problem whether using
np.zeros()
, real audio files, or random noise, so I don't think it's related to the input audio at all.Replacing
fft_size
with128
or larger tends to resolve the segmentation fault.The following parameter combinations (with the code above) will result in an error of
The backtraces from gdb generally show the problems are occurring in the synthesize method of the c++ library.
Thank you for your help.