mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.17k stars 251 forks source link

Realtime synthesis #47

Closed m-toman closed 5 years ago

m-toman commented 6 years ago

Hi,

could you tell me how the buffer_size in realtimesynthesis.cpp is thought to be used? And the file in general? Is the idea that there are two threads where one calls AddParameter from time to time, the other doing Synthesis2? Or that I run AddParameters, then Synthesis2, then AddParameters, then Synthesis2...?

I'm especially confused by this loop: https://github.com/mmorise/World/blob/master/src/synthesisrealtime.cpp#L583 because it loops until buffer_size is reached, even if there are no parameters left?

What I want to achieve is to feed a sentence in multiple chunks to the vocoder, so e.g. the parameters of one phone, get the resulting waveform (and forward it to someone else), then the next phone. I first tried the regular synthesis.cpp, chopped up the feature vectors in multiple chunks and just run "Synthesis" for each chunk - which led to discontinuities between the chunks (like a popping noise).

Thanks a lot.

mmorise commented 6 years ago

Real-time synthesis consists of a complex processing. If the following explanation is irrelevant, please point out my misunderstanding.

For example, assuming that frame shift was set to 5 ms, and two frames were added by AddParameters(). We can generate the waveform with the duration of 5 ms (0...5 ms). If the buffer_size is 2 ms, synthesis2() can first generate the waveform (0...2 ms). The parameter synth->synthesized_sample is set to 2 ms. After that, the function can generate the waveform (2...4 ms), and the synthesized_sample is set to 4 ms.

The parameters are still left, but we cannot synthesize the waveform (4...6 ms) because we can only generate the waveform from 0 to 5 ms. The condition in Line 583 is used to check whether we can generate the waveform or not. (In actual, the more complex parameter in vocal cord vibration are used, but this is the main idea. The parameter current_location is determined based on the temporal position of vocal cord vibration calculated from the added parameters.)

If the duration we can generate the parameter is lack, the condition does not indicate true. After adding one frame by AddParameters(), we can generate the waveform (4...6 ms).

AddParameter() can add the parameters consisting of many frames at one-time. Second argument is associated with the number of frames to be added. Different values are used in two examples.

In the popping noise, it is difficult to determine the cause of this. Perhaps, the synthesis2() may have a bug, or there may be a bug in your original implementation.

Best regards,

m-toman commented 6 years ago

Thank you. The popping issue was not when using your Realtimesynthesis code but calling the original synthesis.cpp multiple times and then concatenating the results.

OK thank you. I first tried a random buffer size of 10000 while just adding a few feature frames - This seemed to still fill up the buffer with something - even if not enough frames are available.

So I tried a buffer size of 240 (assuming this is needed to hold the waveform for a single 5ms feature frame at 48khz) but this segfaulted. As I understand it, it should work with 240, so I'll further investigate what happened. UPDATE: I suspect it's my fault now

Thanks again

m-toman commented 6 years ago

Hi again,

meanwhile I got the streaming synthesis to run. For a number of voices I found that there are "popping" noises where the transitions of the chunks are.

I've attached to samples, perhaps you have an idea how to avoid that?

This is the regular version: https://drive.google.com/file/d/0B2D9egvk27W3ZEJwS19zaDBYWGs/view?usp=sharing

This is the streaming version: https://drive.google.com/file/d/0B2D9egvk27W3c1ZXZkhMVVViSkU/view?usp=sharing A can for example hear the popping in "identity".

The chunk borders should be at the following frames: currSample 7920 currSample 21120 currSample 42480 currSample 53520 currSample 70080 currSample 79680 currSample 107760

Thanks

mmorise commented 6 years ago

Thank you for your report.

I confirmed the popping noise. If possible, could you give me the speech parameters used to synthesize the attached samples? The temporal position of vocal cord vibration may wrongly shifted.

snap

This error was often observed in the previous version of WORLD, but I should have debugged it. Even the latest version may still not be able to work in your example.

m-toman commented 6 years ago

Hi, thanks for the answer.

I've noticed that the issue occurs less with voices that are generally better and I could alleviate it a bit by doing speaker adaptation from those voices.

I have some features here that I've used for testing recently: feats.zip As CSV and already split into chunks as I provided them to the WORLD realtime synthesis. Hope they work out, otherwise I can see that I'll export new ones.

I've updated to the latest version of WORLD for my tests.

BTW, I'm using this piece of code for getting the full aperiodicities from the 5 coarse: https://github.com/CSTR-Edinburgh/merlin/blob/master/tools/WORLD/test/synth.cpp#L348-L378

mmorise commented 6 years ago

Thank you very much. I will check it with your data, so please give me a few days.

mmorise commented 6 years ago

I have checked provided features and tried to generate the waveform. Unfortunately, I cannot synthesize natural speech because I don't know the sampling frequency (fs). I have already tested several values (16, 22.05, 32, 44.1, 48 kHz), but I cannot get speech as you provided. (The result seems to have the formant-shifted timbre)

In spectral envelope, many values in higher frequency band have indicated 0. I think that there is a problem in significant digits.

It would be helpful if you could give me more information.

m-toman commented 6 years ago

Oh right, I'm sorry. It's 48kHz and it might be that it's a female voice.

My streaming code is currently still pretty messy but here are the important parts:

InitializeSynthesizer(48000, 5.0, 2048, 240, 3, this->synth.get());

void DNNStreamingVocoder::Synthesize(int numframes, double* f0, double** sp, double** ap, TTSResultPtr& result) {
   static const double FRAMEPERIOD = 5.0;
   //TODO: arguments
   int fft_size = 2048;
   int fs = 48000;
   int number_of_aperiodicities = 5;
   int f0_length = numframes; 

   double** coarse_aperiodicities = ap;
   double** aperiodicity;

   aperiodicity = new double *[f0_length];
   for (int i = 0; i < f0_length; ++i) {
      aperiodicity[i] = new double[fft_size / 2 + 1];
   }

   // convert bandaps to full aperiodic spectrum by interpolation (originally in d4c extraction):
   // Linear interpolation to convert the coarse aperiodicity into its
   // spectral representation.

   // -- for interpolating --
   double* coarse_aperiodicity = new double[number_of_aperiodicities + 2];
   coarse_aperiodicity[0] = -60.0;
   coarse_aperiodicity[number_of_aperiodicities + 1] = 0.0;
   double* coarse_frequency_axis = new double[number_of_aperiodicities + 2];
   for (int i = 0; i <= number_of_aperiodicities; ++i)
      coarse_frequency_axis[i] =
         static_cast<double>(i) * world::kFrequencyInterval;
   coarse_frequency_axis[number_of_aperiodicities + 1] = fs / 2.0;

   double* frequency_axis = new double[fft_size / 2 + 1];
   for (int i = 0; i <= fft_size / 2; ++i) {
      frequency_axis[i] = static_cast<double>(i) * fs / fft_size;
   }
   // ----

   for (int i = 0; i < f0_length; ++i) {
      // load band ap values for this frame into  coarse_aperiodicity
      for (int k = 0; k < number_of_aperiodicities; ++k) {
         coarse_aperiodicity[k + 1] = coarse_aperiodicities[i][k];
      }
      interp1(coarse_frequency_axis, coarse_aperiodicity,
              number_of_aperiodicities + 2, frequency_axis, fft_size / 2 + 1, aperiodicity[i]);
      for (int j = 0; j <= fft_size / 2; ++j) {
         aperiodicity[i][j] = pow(10.0, aperiodicity[i][j] / 20.0);
      }
   }

   //---------------------------------------------------------------------------
   // Synthesis part
   //---------------------------------------------------------------------------
   // The length of the output waveform
   int y_length = static_cast<int>((f0_length - 1) *
                                   FRAMEPERIOD / 1000.0 * fs) + 1;

   // add for vocoder and also store for cleaning up in the end
   AddParameters(f0, f0_length
                 , sp
                 , aperiodicity
                 , this->synth.get());
   vocoderData.push_back(DNNVocoderData{ f0_length, f0, sp, aperiodicity });

   auto& frames = result->GetFrames();
   while (frames.size() <= y_length) {
      if (IsLocked(this->synth.get())) {
         RefreshSynthesizer(this->synth.get());
      }
      Synthesis2(this->synth.get());

      for (int i = 0; i < this->synth->buffer_size; ++i) {
         frames.push_back(this->synth->buffer[i] * 32767.0);
      }
   }
   result->SetSamplingRate(fs);

   for (int i = 0; i < f0_length; i++) {
      delete[] coarse_aperiodicities[i];
   }
   delete[] coarse_aperiodicities;
   delete[] coarse_aperiodicity;
   delete[] frequency_axis;
}

I could probably also set up a test program.... or do you have anything like that for the realtimesynthesis in C++ or Matlab? (there is nothing in "examples" it seems)

mmorise commented 6 years ago

First, I'd like to confirm whether the synthesized waveform is correct or not. I simply loaded the features and generated the waveform from loaded features. Following is the result. http://ml.cs.yamanashi.ac.jp/media/output.wav

I think that the result is strange. Could I generate the appropriate waveform? If no, I think that there is a problem in .csv files. I used the MATLAB version (it is not the real-time processing) to simply check the features. The sound quality seems to be not good, but there is no popping noise in the result.

I have provided a test program including the real-time synthesis in "test/test.cpp". WaveformSynthesis2() and WaveformSynthesis3() are examples for real-time synthesis. I don't implement the real-time synthesis for MATLAB version.

m-toman commented 6 years ago

Oh, yes this is the voice of a child (who really sounds like that). I will check test.cpp and also see if I can send you something useful, perhaps the code... or probably easier if I send you a binary that you can use to generate sentences without having to build the whole thing. Do you prefer working on linux, windows or macosx?

mmorise commented 6 years ago

Sorry for my late reply.

My environment is Windows 10 and VS 2015. If you can provide a makefile to compile the project, I can also use the linux version. (I will use and check it on the Cygwin environment)

On the other hand, I think that speech parameters that replicate the difference is enough to check the difference between synthesis() and synthesis2().

m-toman commented 6 years ago

Hi,

apart from the e-Mail I sent you I meanwhile also tried synthesis with 16kHz and noticed the same effect. I also experimented with the regular, non-streaming synthesis and pyworld by prepending the end of the previous chunk to the next chunk. But it only yielded minimal improvements. Is there an optimization step that operates on the whole utterances?

Thanks

mmorise commented 6 years ago

Thank you for your question.

I think that there are three cases; speech parameters, the official function in WORLD, or your implementation using WORLD. I'm afraid that since it is difficult to separate the cause in this error, I'd like to identify the problem at the beginning.

If you have a speech parameter to confirm this error, please make the parameter by using codec.cpp and give it to me. I'll be able to identify the cause of error and debug it if needed.

m-toman commented 6 years ago

Sorry for the long time. I actually finally found the issue recently with the help of the test.cpp you provided.

So for reference I post the problem I encountered here:

My code piece was:

while (frames.size() <= y_length) {
      if (IsLocked(this->synth.get())) {
         RefreshSynthesizer(this->synth.get());
      }
      Synthesis2(this->synth.get());

      for (int i = 0; i < this->synth->buffer_size; ++i) {
         frames.push_back(this->synth->buffer[i] * 32767.0);
      }
   }

Your demo on the other hand only copies frames from Synthesis2 when the return value is OK. So I guess I copied a few corrupted frames between the chunks, resulting in the popping noises.

It is now basically indistinguishable from the non-streaming output ;)

Thanks for the help

attitudechunfeng commented 5 years ago

@mmorise , A question about synthesis in realtime, i'd like to know how to set the initial paramter "int number_of_pointers"?

mmorise commented 5 years ago

You can set the number_of_pointers by using InitializeSynthesizer(). There is an example in test.cpp. Please see the line 300 (WaveformSynthesis3()).

attitudechunfeng commented 5 years ago

Okay, i've got it, thanks.