Closed alexdemartos closed 4 years ago
The core idea is already implemented in this repository (Multi reso STFT loss with MelGAN). I think it is very easy to adapt.
I checked the paper in detail. FB-MelGAN is almost the same as this repository’s melgan.v3 config. For MB-MelGAN, I need to implement the filters to synthesize the audio from sub band signals.
Anyway, I will try to extend this repository but it depends on my motivation :)
@kan-bayashi :))) I think you can do it in 1 day :)).
First, I will check the full-band melgan setting (#145).
https://authors.library.caltech.edu/6848/1/KOIieeetsp92.pdf https://www.academia.edu/10919484/Near-perfect-reconstruction_pseudo-QMF_banks From these papers, I almost understood how to implement the PQMF analysis-synthesis filter but I'm not sure how to design the prototype filter.
Maybe I could implement PQMF.
The design of the filter is too heuristic, it is better to understand more about digital signal processing...
I could not understand the reason why the MB-MelGAN's upsampling scales are different from the FB-MelGAN. In my understanding, PQMF-analysis does not change the length of the waveform.
maybe that is a trick to speedup a inference, i just saw a paper upsampling to 50 (there are 4 sub-band), that mean hop size = 200 ?
As I mentioned above, in my understanding, PQMF-analysis does not change the length of the waveform. So sub-banded signal has the same length as the original waveform. How do we adjust the length from 50 to 200?
I do not understand also :))), why paper use upsampling scale 2,5,5 :)))
That mean output is 4 units then reshape to 200?
But from the figure 1, it seems that they convert 4 ch signal to 1 ch signal with PQMF synthesis filter.
I will check the paper DurIAN again. Maybe my understanding of PQMF analysis-synthesis part is wrong.
https://arxiv.org/pdf/1909.01700.pdf
Downsamping is achieved by taking every Nth samples, and upsampling is done by padding zeros between original signal.
Oh, I missed this trick. I will check it.
I could do it! This is very interesting :)
So how can we adjust a length from 50 to 200?, i have not read paper in detail yet.
Yeah, we can reduce the subband signal length from N to N / 4. Then x50 upsampling is enough.
I finished the implementation (#147). Now the training on LJSpeech is on-going.
This multi-band technique can also be applied to PWG. (We do not need to modify the current implementation, already available) If I have free GPU slots, I will also try it.
I am training on tensorflow, what is ur results now ? It seems a model convergence so slow
I was struggling with the speed up of large batch-size, so the iterations is still around 300k iterations, not converged yet. How many iterations do you use?
1M step with bảtch size 128, pretrained G 200k steps as paper suggested
Thanks, now I am training with batch_size = 64. Let us wait for this weekend :)
I checked the sample @ 350k. I found that there are disconnections between subbands. I'm not sure the further training can solve this problem.
@kan-bayashi, About 1M step, there is no disconnections. but i see the scloss and log mag loss increase when introduce the discriminator, may be decrease the pretrain step?
Oh, that is good news! The STFT loss will be increasing when introducing discriminator loss but the STFT loss value itself is not directly related to perceptual quality. So we need to check the quality carefully. How was the quality?
@kan-bayashi, not Good Now, it seems the original demo(https://anonymous9747.github.io/demo/) works well under 16k audio, but when use 24k, there is less high frequency details. And i use 24k data to train, so there is gap in the generated wav with the GT wave.
I see. I'm wondering how the number of subbands or PQMF prototype filter affects quality. How did you design the PQMF filter?
i check the open source pqmf: https://ww2.mathworks.cn/matlabcentral/fileexchange/35053-uniform-filter-bank
Thank your for your information! I will check it.
Now I'm wondering whether we should perform analysis-synthesis for the ground-truth waveform in training.
Disconnections disappeared @ 400k iters. Interesting :)
can u share sample audios @kan-bayashi
OK. I will add items here. https://drive.google.com/open?id=1ls_YxCccQD-v6ADbG6qXlZ8f30KrrhLT
seem everything work fine, let see the performance at 1M steps :v. BTW, i see many your interspeech 2020 papers :))), you are superman :))
@kan-bayashi can u share your tensorborad?
Sorry, I resumed several times so a bit conflicted around 200k iters.
Is the code in https://github.com/kan-bayashi/ParallelWaveGAN/pull/147 up to date with what's being shown here in this issue report?
Right. Here I report the training progress with #147.
Cool, thanks! :D
Out of curiosity, do you have any write-ups on how to setup custom datasets for this project? I've got about 2 hours of data of a 48khz voice I'd like to try out training but I'm not sure where to begin.
Rad, thanks. Going to see if I can make some Persona voice models.
@kan-bayashi do u think there is a different between upsample scales [2, 4, 8] and [8, 4, 2]. I just saw a paper use [2, 5, 5] for hop-size 200, so maybe [2,4,8] will better ?
@kan-bayashi do u think there is a different between upsample scales [2, 4, 8] and [8, 4, 2]. I just saw a paper use [2, 5, 5] for hop-size 200, so maybe [2,4,8] will better ?
Yeah, I'm wondering which is better. We need to compare.
Is there an easy way to set where to save the model and checkpoints to when training a new model? I'm not sure where they go right now but I'd like to pipe them into my Google Drive, as I'm working on running the project via Google Colab.
How much time did it take you to get the multiband model to 400k steps?
Is there an easy way to set where to save the model and checkpoints to when training a new model? I'm not sure where they go right now but I'd like to pipe them into my Google Drive, as I'm working on running the project via Google Colab.
@Iamgoofball Please not to raise your question on this issue.
The model will be saved in under egs/<recipe_name>/voc1/exp/
directory.
To understand the directory structure, please run egs/yesno/voc1
.
It will finish within a minute.
Hi,
just found https://arxiv.org/pdf/2005.05106.pdf
It seems to provide significantly better quality than regular MelGAN, and is also stunningly fast (0.03 RTF on CPU). The authors will be publishing the code shortly.
Any chances we will see an implementation in this great repo? =)