mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.17k stars 251 forks source link

Memory issues in Synthesize #84

Closed QEDan closed 4 years ago

QEDan commented 5 years ago

I am using the python wrapper (https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder). I posted an issue in that repo as well, but I think the issue may be in this c++ library. At least, I would be interested to hear thoughts about the issue.

I get segmentation faults when using non-default combinations of frame_period and fft_size. For example, the following script results in a segmentation fault:

double free or corruption (!prev) Program received signal SIGABRT, Aborted.

import pyworld as pw
import numpy as np

frame_period = 5.0
fft_size = 64
fs = 16000
x = np.zeros((fs * 3))
f0, sp, ap = pw.wav2world(x, fs,
                          fft_size=fft_size,
                          frame_period=frame_period)
y = pw.synthesize(f0, sp, ap, fs, frame_period=frame_period)

I get the same problem whether using np.zeros(), real audio files, or random noise, so I don't think it's related to the input audio at all.

Replacing fft_size with 128 or larger tends to resolve the segmentation fault.

The following parameter combinations (with the code above) will result in an error of

free(): invalid size

frame_period = 5.0
fft_size = 128
fs = 32000

The backtraces from gdb generally show the problems are occurring in the synthesize method of the c++ library.

Thank you for your help.

QEDan commented 5 years ago

I've confirmed this issue in this library using the test.cpp file and setting the fft_size there. While I was in there, I saw the "Important notice (2017/01/02)" in the comments about changing fft_size and how it impacts other parameters such as f0_floor. It isn't clear exactly how these parameters impact one another in the code, so I appreciate any guidance on working with small fft_sizes and large frame_periods.

JeremyCCHsu commented 5 years ago

Hi, the recommended fft_size is a large number such as 2048; it has to be at least larger than frame_period * fs / 1000 (i.e., frame period in sample points). In the two combinations you presented, fft_size is way smaller than the threshold, and could have violated some of the presumptions in WORLD vocoder.

Could you explain the reason that you set fft_size to as small as 64 when fs=16000? As your setting is not recommended for high quality speech analysis/reconstruction, I would like to hear the reason so that we can help.

QEDan commented 5 years ago

Hi. We are using the vocoder in deep learning and we want to keep the size of the acoustic parameters small. For us, there is a balance between high quality synthesis and having a lower dimensionality space that's easier for our models to learn. I do not necessarily need to go to fft_size=64, but this was the number where I could demonstrate the problem reliably with a simple script. I have more intermittent problems with the memory at fft_size=256 or even fft_size=512. Usually things work reliably at fft_size=1024 or larger.

I am building on the work done in this paper: https://arxiv.org/pdf/1904.01537.pdf

The authors used World via Merlin for their research. They don't share the details of what parameters they used for World, but they do share the number of parameters used for the encoding: "There are 60 features for spectral en-velope, 5 for band aperiodicity, 1 for F0 and a boolean flag for the voiced/unvoiced decision" The defaults for World tend to produce much larger parameter spaces than what these authors were using.

I have reached out to them to get more details and am still hoping to hear back.

I appreciate your help. Please let me know if you can offer any insights or guidance.

JeremyCCHsu commented 5 years ago

Thanks for clarification.

Conventionally, the low-dimensional spectral features are all computed from high-dimensional spectrograms. One of the most commonly adopted low-dim spectral feature is log Mel spectrogram (as was used in the paper you mentioned). Mel spec was extracted from high-dim features (such as WORLD sp). Thus, you can extract sp and then apply Mel warping to reduce the dimension to 40 - 120.

Many DSP packages has linear-spectogram to mel-spectrogram conversion. (librosa, tensorflow, etc.)

Similarly, ap is usually replaced with band aperiodicity bap, with dimension of 5 or so.

It's also common to train a dimension reduction mapping using NNs. (Simply feed the spctrograms and apply convolutional stack along the frequency axis.)

The rest is left to you to explore.