kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.56k stars 342 forks source link

Multi-band MelGAN #143

Closed alexdemartos closed 4 years ago

alexdemartos commented 4 years ago

Hi,

just found https://arxiv.org/pdf/2005.05106.pdf

It seems to provide significantly better quality than regular MelGAN, and is also stunningly fast (0.03 RTF on CPU). The authors will be publishing the code shortly.

Any chances we will see an implementation in this great repo? =)

kan-bayashi commented 4 years ago

The core idea is already implemented in this repository (Multi reso STFT loss with MelGAN). I think it is very easy to adapt.

kan-bayashi commented 4 years ago

I checked the paper in detail. FB-MelGAN is almost the same as this repository’s melgan.v3 config. For MB-MelGAN, I need to implement the filters to synthesize the audio from sub band signals.

kan-bayashi commented 4 years ago

Anyway, I will try to extend this repository but it depends on my motivation :)

dathudeptrai commented 4 years ago

@kan-bayashi :))) I think you can do it in 1 day :)).

kan-bayashi commented 4 years ago

First, I will check the full-band melgan setting (#145).

kan-bayashi commented 4 years ago

https://authors.library.caltech.edu/6848/1/KOIieeetsp92.pdf https://www.academia.edu/10919484/Near-perfect-reconstruction_pseudo-QMF_banks From these papers, I almost understood how to implement the PQMF analysis-synthesis filter but I'm not sure how to design the prototype filter.

kan-bayashi commented 4 years ago

Maybe I could implement PQMF.

スクリーンショット 2020-05-16 午後4 56 44
kan-bayashi commented 4 years ago

The design of the filter is too heuristic, it is better to understand more about digital signal processing...

kan-bayashi commented 4 years ago

I could not understand the reason why the MB-MelGAN's upsampling scales are different from the FB-MelGAN. In my understanding, PQMF-analysis does not change the length of the waveform.

dathudeptrai commented 4 years ago

maybe that is a trick to speedup a inference, i just saw a paper upsampling to 50 (there are 4 sub-band), that mean hop size = 200 ?

kan-bayashi commented 4 years ago

As I mentioned above, in my understanding, PQMF-analysis does not change the length of the waveform. So sub-banded signal has the same length as the original waveform. How do we adjust the length from 50 to 200?

dathudeptrai commented 4 years ago

I do not understand also :))), why paper use upsampling scale 2,5,5 :)))

dathudeptrai commented 4 years ago

That mean output is 4 units then reshape to 200?

kan-bayashi commented 4 years ago

But from the figure 1, it seems that they convert 4 ch signal to 1 ch signal with PQMF synthesis filter.

kan-bayashi commented 4 years ago

I will check the paper DurIAN again. Maybe my understanding of PQMF analysis-synthesis part is wrong.

kan-bayashi commented 4 years ago

https://arxiv.org/pdf/1909.01700.pdf

Downsamping is achieved by taking every Nth samples, and upsampling is done by padding zeros between original signal.

Oh, I missed this trick. I will check it.

kan-bayashi commented 4 years ago

I could do it! This is very interesting :)

dathudeptrai commented 4 years ago

So how can we adjust a length from 50 to 200?, i have not read paper in detail yet.

kan-bayashi commented 4 years ago

Yeah, we can reduce the subband signal length from N to N / 4. Then x50 upsampling is enough.

kan-bayashi commented 4 years ago

I finished the implementation (#147). Now the training on LJSpeech is on-going.

kan-bayashi commented 4 years ago

This multi-band technique can also be applied to PWG. (We do not need to modify the current implementation, already available) If I have free GPU slots, I will also try it.

dathudeptrai commented 4 years ago

I am training on tensorflow, what is ur results now ? It seems a model convergence so slow

kan-bayashi commented 4 years ago

I was struggling with the speed up of large batch-size, so the iterations is still around 300k iterations, not converged yet. How many iterations do you use?

dathudeptrai commented 4 years ago

1M step with bảtch size 128, pretrained G 200k steps as paper suggested

kan-bayashi commented 4 years ago

Thanks, now I am training with batch_size = 64. Let us wait for this weekend :)

kan-bayashi commented 4 years ago

I checked the sample @ 350k. I found that there are disconnections between subbands. I'm not sure the further training can solve this problem.

スクリーンショット 2020-05-19 午前10 45 43
azraelkuan commented 4 years ago

@kan-bayashi, About 1M step, there is no disconnections. but i see the scloss and log mag loss increase when introduce the discriminator, may be decrease the pretrain step?

kan-bayashi commented 4 years ago

Oh, that is good news! The STFT loss will be increasing when introducing discriminator loss but the STFT loss value itself is not directly related to perceptual quality. So we need to check the quality carefully. How was the quality?

azraelkuan commented 4 years ago

@kan-bayashi, not Good Now, it seems the original demo(https://anonymous9747.github.io/demo/) works well under 16k audio, but when use 24k, there is less high frequency details. And i use 24k data to train, so there is gap in the generated wav with the GT wave.

kan-bayashi commented 4 years ago

I see. I'm wondering how the number of subbands or PQMF prototype filter affects quality. How did you design the PQMF filter?

azraelkuan commented 4 years ago

i check the open source pqmf: https://ww2.mathworks.cn/matlabcentral/fileexchange/35053-uniform-filter-bank

kan-bayashi commented 4 years ago

Thank your for your information! I will check it.

kan-bayashi commented 4 years ago

Now I'm wondering whether we should perform analysis-synthesis for the ground-truth waveform in training.

kan-bayashi commented 4 years ago
スクリーンショット 2020-05-19 午後4 21 20

Disconnections disappeared @ 400k iters. Interesting :)

dathudeptrai commented 4 years ago

can u share sample audios @kan-bayashi

kan-bayashi commented 4 years ago

OK. I will add items here. https://drive.google.com/open?id=1ls_YxCccQD-v6ADbG6qXlZ8f30KrrhLT

dathudeptrai commented 4 years ago

seem everything work fine, let see the performance at 1M steps :v. BTW, i see many your interspeech 2020 papers :))), you are superman :))

azraelkuan commented 4 years ago

@kan-bayashi can u share your tensorborad?

kan-bayashi commented 4 years ago
スクリーンショット 2020-05-19 午後5 05 31

Sorry, I resumed several times so a bit conflicted around 200k iters.

Iamgoofball commented 4 years ago

Is the code in https://github.com/kan-bayashi/ParallelWaveGAN/pull/147 up to date with what's being shown here in this issue report?

kan-bayashi commented 4 years ago

Right. Here I report the training progress with #147.

Iamgoofball commented 4 years ago

Cool, thanks! :D

Iamgoofball commented 4 years ago

Out of curiosity, do you have any write-ups on how to setup custom datasets for this project? I've got about 2 hours of data of a 48khz voice I'd like to try out training but I'm not sure where to begin.

kan-bayashi commented 4 years ago

See https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#how-to-make-the-recipe-for-your-own-dateset

Iamgoofball commented 4 years ago

Rad, thanks. Going to see if I can make some Persona voice models.

dathudeptrai commented 4 years ago

@kan-bayashi do u think there is a different between upsample scales [2, 4, 8] and [8, 4, 2]. I just saw a paper use [2, 5, 5] for hop-size 200, so maybe [2,4,8] will better ?

kan-bayashi commented 4 years ago

@kan-bayashi do u think there is a different between upsample scales [2, 4, 8] and [8, 4, 2]. I just saw a paper use [2, 5, 5] for hop-size 200, so maybe [2,4,8] will better ?

Yeah, I'm wondering which is better. We need to compare.

Iamgoofball commented 4 years ago

Is there an easy way to set where to save the model and checkpoints to when training a new model? I'm not sure where they go right now but I'd like to pipe them into my Google Drive, as I'm working on running the project via Google Colab.

ZDisket commented 4 years ago

How much time did it take you to get the multiband model to 400k steps?

kan-bayashi commented 4 years ago

Is there an easy way to set where to save the model and checkpoints to when training a new model? I'm not sure where they go right now but I'd like to pipe them into my Google Drive, as I'm working on running the project via Google Colab.

@Iamgoofball Please not to raise your question on this issue. The model will be saved in under egs/<recipe_name>/voc1/exp/ directory. To understand the directory structure, please run egs/yesno/voc1. It will finish within a minute.