roebel / MBExWN_Vocoder

The Multi-band Excited WaveNet
GNU Affero General Public License v3.0
13 stars 4 forks source link

No training script? #1

Open andreykramer opened 2 years ago

andreykramer commented 2 years ago

Hi, first of all thanks for sharing your code and pretrained models, they sound great.

I'd like to ask whether it would be possible for you to upload the training script you used, to train my own model, as I see that it's not present anywhere in the repository.

Thanks in advance.

roebel commented 2 years ago

Thanks for your interest.

Unfortunately, I won't be able to share the training scripts. The problem here is that my training implementation uses an input pipeline, F0 estimator, and database management that cannot be shared publicly, and on the other hand reimplementing this with publicly available means would represent quite some work without bringing me any advantage for the applications I need to develop for my work.

That said, I perfectly understand the benefit of an alternate implementation for training the model using only public tools and frameworks using one or two publicly available datasets, and I would consider contributing to an alternate implementation of such a training setup if I would not have to do this alone.

andreykramer commented 2 years ago

Thanks for your interest.

Unfortunately, I won't be able to share the training scripts. The problem here is that my training implementation uses an input pipeline, F0 estimator, and database management that cannot be shared publicly, and on the other hand reimplementing this with publicly available means would represent quite some work without bringing me any advantage for the applications I need to develop for my work.

That said, I perfectly understand the benefit of an alternate implementation for training the model using only public tools and frameworks using one or two publicly available datasets, and I would consider contributing to an alternate implementation of such a training setup if I would not have to do this alone.

I understand. I will fork the repo and try to implement the training script with public resources myself and if you don't mind hit you with questions if I get stuck. My idea would be to replicate at least the MW-SP-FD training, with a public dataset like VCTK. For F0 do you think it would be too detrimental to the performance using just the confidence value of FCN to skip the harmonicity calculation? With the privative code issue set aside, I'd like to ask you some questions:

  1. Is adaptative normalization used in MW-SP-FD? Is it the normalize_inputs_by_rms function where it's applied? Because I see there's also a norm_mell function that is not entered during inference with this model, and I'm not sure which one is the one referenced as adaptative normalization in the paper.
  2. If I understand correctly the training of the model was as follows:
    1. 100k batches where f0 prediction model is pre-trained.
    2. 200k batches of generator without discriminator loss.
    3. 800k batches of generator with discriminator loss.
  3. Was the f0 prediction trained jointly during the whole training or just in step 1?
  4. I can not find the code for the discriminator/adversarial loss in the repo, would it be the same as this unofficial implementation?

Thanks a lot.

roebel commented 2 years ago

I understand. I will fork the repo and try to implement the training script with public resources myself and if you don't mind hit you with questions if I get stuck ... VCTK

Sure that was the idea. I could also take part in the implementation work to have a training script in this repos as well. I was planning at some point to propose adding the vocoder to the TensorflowTTS project on github.

For F0 do you think it would be too detrimental to the performance using just the confidence value of FCN to skip the harmonicity calculation?

This is a good question. I did not try this but looking at the confidence values of the FCN estimator did not create a lot of confidence in me. That's why I invested some time to enhance the confidence estimation by means of combining it with a harmonicity estimation from one of our other software tools. The combination unfortunately requires the other non-python tool which is not publicly available.

Before we had FCN we used swipe and for me this would be the first thing to try. There are a few implementations but some don't output the strength so you don't have voice/unvoiced either, and I have never tried the others. I have a python only implementation I could integrate here easily. This would be a part I can contribute.

Is adaptative normalization used in MW-SP-FD? Is it the normalize_inputs_by_rms function where it's applied? Because I see there's also a norm_mell

Yes! For me this was important. The vocoder works for arbitrary signal scaling factors. You may also not use this and train with random scaling factors. It degrades quite negligibly, but as the normalization function is provided I would not see any argument not to use it.
self.norm_mel_components.normalize_inputs_by_rms is the right function. I started experimenting with the norm_mell function in NumPy but later made a tensorflow-specific implementation directly in the vocoder.

  1. If I understand correctly the training of the model was as follows:

You have got that perfectly right!

  1. Was the f0 prediction trained jointly during the whole training or just in step 1?

All the time.

  1. I can not find the code for the discriminator/adversarial loss in the repo, would it be the same as this unofficial implementation?

I don't know this repos and cannot tell you whether it works. Anyway, that one is PyTorch and therefore more complicated to copy. I would suggest you use the implementation from the TensorflowTTS repos. I have used that one as my reference implementation.

Another question from my end. Where will you train the model? Do you have access to a GPU?

Best

andreykramer commented 2 years ago

Thanks a lot, now it's clear to me. I will begin working on it soon and will appreciate any contribution you may make.

Another question from my end. Where will you train the model? Do you have access to a GPU?

Yes! I will train on a RTX Titan

andreykramer commented 2 years ago

Another question, do you use MELInverter's scale_mel for anything during training/data preprocessing? I see it is used in inference but the goal is not very clear to me.

Is it here in case you generate mels with different parameters than the ones the model requires? Am I right in that if you use the model's config for generating the mels you no longer need to use scale_mel?

roebel commented 2 years ago

Thanks a lot, now it's clear to me. I will begin working on it soon and will appreciate any contribution you may make.

I will create a branch for this and integrate the swipe script into a utils subdirectory.

Is it here in case you generate mels with different parameters than the ones the model requires? Am I right in that if you use the model's config for generating the mels you no longer need to use scale_mel?

Yes! You see the script in scripts/generate_mels.py. It shows that mel analyses are stored as pickle files containing quite a few parameters that are used to configure the mel analysis. scale mel check that all these fit the model's expectations. For some discrepancies, the scale mel function can adapt the mel spectrogram such that it fits the model's expectations. For other discrepancies, it will quit with an error. If you are sure that your mel was properly configured you don't need scale_mel. On the other hand, given the many parameters that exist when creating mel spectrograms it is pretty easy to get it wrong. The interface in resynth_mel.py forces you to be explicit about the parameters you used and will therefore avoid problems.

andreykramer commented 2 years ago

I will create a branch for this and integrate the swipe script into a utils subdirectory. Will pull in my repo when you have it

The interface in resynth_mel.py forces you to be explicit about the parameters you used and will therefore avoid problems. I see, thanks!

I started working on the dataset class. As you were saying that you want it eventually to be integrated in TensorflowTTS, I'm trying to hit a balance between their interfaces and what you already have in your repo so it's easier for you to integrate later. The logic I implemented is based on the hifigan loader and is the following:

  1. Find pairs of audio/*mell pickle.
  2. Load audio and mel.
  3. Cut them beginning at a random index if the audio is longer than the desired example length, or pad otherwise. Here we take into account that each mel frame corresponds to hop_size audio samples.
  4. Batch and return

You can find the code here and I would be thankful if you took a look at it. Suggestions accepted: https://github.com/andreykramer/MBExWN_Vocoder/blob/feature/add_train_script/MBExWN_NVoc/vocoder/dataset/pickle_mel_dataset.py

For the moment it seems a lot slower than it should, I will experiment by saving the mel in *.npy files to see if I can gain performance with that.

When you upload the f0 code I will try to integrate it too in this dataset class. With that, we would have everything you want in a batch right? Mel, audio and f0.

roebel commented 2 years ago

You now find a new branch training_implementation that contains support for F0 analysis with ./bin/calc_f0_swipe.py. It can either output pickle files or txt files. I had rather good results with a harmonicity threshold of 0.3. For the training database, you need to set unvoiced frames with harmonicity below the threshold to 0. This is done with the -z command line arg.

When you upload the f0 code I will try to integrate it too in this dataset class. With that, we would have everything you want in a batch right? Mel, audio and f0.

yes, and see the branch

For the moment it seems a lot slower than it should, I will experiment by saving the mel in *.npy files to see if I can gain performance with that.

For me, pickle is faster than npy.

The problem I see here with the proposed approach is that you will either have to throw away large amounts of data or you will need to have rather long segments in the batches (which implies that many of the examples in the batch have many zeros). So I think it would be best not to cut files and store full-length data in either pickle files or else do the cutting randomly during batch generation.

I could imagine these approaches.

These are just my ideas. So please do as you deem fits best.

andreykramer commented 2 years ago

Hi, for the moment I will integrate loading the f0 inside the PickleMelDataset that I created, and focus on the training once I have all the input data, because I want to finish the training asap and not bother you with questions for an extended period of time.

I can focus again on optimizing the data loading pipeline afterwards, as it will be more straightforward for me having some experience with that, thanks for the suggestions. I have already worked with tftrecords and it would solve the performance issues but it's more hassle, you do need fixed length for all the features. What I did before was storing the audios and features already windowed and cut, so each time you load an element it's an example corresponding to n seconds of audio. I thought that the approach I implemented in PickleMelDataset will work well enough, as it does in hifigan, even though you do discard a lot of information from each audio. Will try modify it to work fast enough without having to create tfrecords.

andreykramer commented 2 years ago

do you think it's ok to generate the f0 with -s 0.0125 instead of the default 0.005 so the frame rate is the same of the SPEECH model's? (hop_size/sr=300/24000=0.0125)

roebel commented 2 years ago

No! Look into the schematic figure in the README. The F0 is evaluated at the output of the F0 net, and to avoid any interpolation in the model F0 target signal is just a subsampled version of the F0 from the batches which is supposed to be sampled with the model sample rate (24kHz). There are the following arguments:

andreykramer commented 2 years ago

No! Look into the schematic figure in the README. The F0 is evaluated at the output of the F0 net, and to avoid any interpolation in the model F0 target signal is just a subsampled version of the F0 from the batches which is supposed to be sampled with the model sample rate (24kHz). There are the following arguments:

* This is the most flexible setup where you can freely change the internal model structure without needing to recreate your pickle files.

* The F0 interpolation is a little bit slow. Doing this during training would lose a lot of time. Slicing is the fastest way to get the F0 to the proper sample rate.
  Which combination of swipe analysis parameters with linear interpolation you choose is left to your intuition. I used the F0 SR like in the FCN_F0 paper (which from memory is 1000Hz). I never tried something else, so I cannot tell you what effect you will get with swipe analysis with 80Hz.

So did I understand correctly that the steps are:

andreykramer commented 2 years ago

I am now implementing the backbone of the training script. So far I integrated the TFMelGANMultiScaleDiscriminator from TensorflowTTS, included the discriminator configuration from that repo inside the config.yaml of the speech model and calculate the adversarial loss as they do

        adv_loss = 0.0
        for i in range(len(disc_outs)):
            adv_loss += calculate_3d_loss(
                tf.ones_like(disc_outs[i][-1]), disc_outs[i][-1], loss_fn=self.mse_loss
            )
        adv_loss /= i + 1
        return adv_loss

Now I'm with the LR loss, do I understand correctly that it is spect_loss_n from PaNWaveNet's total_loss?

I see in training_utils.py there's a base scheduler class which I guess is for setting the correct weight to each loss during the training schedule. If that's right, do you have the scheduler class for the training in the paper (100k L_f0, 200k L_f0 + L_R, 800k L_f0 + L_R + L_D)?

andreykramer commented 2 years ago

And can the F0 loss of each iteration be retrieved from PaNWaveNet's block.F0_loss or I have to do something else to compute it? I'm now solving a small problem with the adversarial loss I'm having, and will upload snippets of how I'm obtaining each of the losses so you can review.

roebel commented 2 years ago

Load pickle, and drop two of each 3 elements (do you mean this by slicing? x[::3]), this way we will have 1 mel frame, 100 f0 frames and 300 audio frames like in the diagram.

You don't need to slice yourself. Slicing is done in the code for you. You need to keep sound samples and F0 samples with the same sample rate.

roebel commented 2 years ago

And can the F0 loss of each iteration be retrieved from PaNWaveNet's block.F0_loss or I have to do something else to compute it?

Yes, it is directly the block.F0_loss.

roebel commented 2 years ago

Now I'm with the LR loss, do I understand correctly that it is spect_loss_n from PaNWaveNet's total_loss?

It's a weighted sum of the spect_loss_n and the NPow loss.

If that's right, do you have the scheduler class for the training in the paper (100k L_f0, 200k L_f0 + L_R, 800k L_f0 + L_R + L_D)?

You find the PieceWise Constant Scheduler in the dedicated training branch.

andreykramer commented 2 years ago

You don't need to slice yourself. Slicing is done in the code for you. You need to keep sound samples and F0 samples with the same sample rate.

Oh nice, one problem less to worry about.

It's a weighted sum of the spect_loss_n and the NPow loss.

In the SPEECH model NPow loss appears as None for me, same as the mel loss. Also only the spectral seems to be dependant on the scheduler. The total_loss calculated in WaveGenerator is a weighted sum of spect_loss_n, mel_loss_n and NPOW_loss_n, and then the block losses are summed to it in the total_loss func of PaNWaveNet. Seeing that NPOW_loss_n is None in the SPEECH model and it doesn't have a weight scheduled by the scheduler, wouldn't it be sensible leaving it out of the training script for now? Otherwise, I think it should be given its own schedule.

You find the PieceWise Constant Scheduler in the dedicated training branch.

Thanks!

andreykramer commented 2 years ago

So far this is how I calculate the losses:

    def adv_loss(self, disc_outs):
        adv_loss = 0.0
        for i in range(len(disc_outs)):
            adv_loss += calculate_3d_loss(
                tf.ones_like(disc_outs[i][-1]), disc_outs[i][-1], loss_fn=self.mse_loss
            )
        adv_loss /= i + 1
        adv_loss = tf.reduce_mean(adv_loss)
        return adv_loss

    def lr_loss(self, ins, outs, step=0):
        total_loss, spect_loss, mel_loss, NPOW_loss, *block_losses = self.generator.total_loss(
            outs, ins, step)
        return spect_loss

    def f0_loss(self):
        return self.generator.block.F0_loss
roebel commented 2 years ago

You can leave out the NPOW_loss it does not make a fundamental difference. For the adversarial loss I wont have the time to look into your calculate_3d_loss function. What is very strange is that you seem to systematically have ones as targets. Depending on how you use the function this may or may not be correct.

andreykramer commented 2 years ago

What is very strange is that you seem to systematically have ones as targets. Depending on how you use the function this may or may not be correct.

I copied it form melgan's training in TensorflowTTS, but will give it a thought before training. Thanks! https://github.com/TensorSpeech/TensorFlowTTS/blob/136877136355c82d7ba474ceb7a8f133bd84767e/examples/melgan/train_melgan.py#L122

andreykramer commented 2 years ago

Yeah you're right, there's a lot more to the adversarial loss than this, fixing it in the trainer I'm implementing now based on MelganTrainer

andreykramer commented 2 years ago

I'm not going to be able to work on this during some time, going on vacations, but I do intend finishing it. I pushed a messy wip commit so I don't lose my progress, so don't take it as anything final.

roebel commented 2 years ago

Fine, enjoy your holiday.