Closed choiHkk closed 9 months ago
I think this method is working. After applying audio preprocessing, modifying the residual coupling layer, and adjusting the "discriminator" what i mentioned it before, it seems that meaningful results are coming out from the training.
I will continue the training and if I find a useful checkpoint, I will share it with you.
Very nice insight; I should have had been more careful earlier. I think one more thing to fix right here is when using sdp
, we need to send in the noise input
of sdp
to the discriminator as well. Can you send a PR of your discriminator and I will merge it? Thanks a lot!
@p0p4k Of course. But I will make sure to modify it to be compatible with the existing functionality since there might be conflicts due to the various changes. After that, I will send a pull request.
According to the author's paper, there is no separated noise to the discriminator directly.
Did you mean that you want to experiment with a different noise contrast based on the paper?
Ah, my bad. I was thinking about something else. Ignore my previous comment.
Also about code breaking changes, just make it dur_disc_2.
@p0p4k it's ok kkk.
Could I make the necessary adjustments after adding it to the config so that it can be reflected in the training process? I will check for conflicts based on the most recent branch and send a pull request as soon as possible.
Yes, do what you think is best. Thank you for your efforts.
@p0p4k I just sent a pull request. I have verified that both training and inference are proceeding correctly. One concern is that I did not include any changes to the requirements. However, if you need it, I will leave the changes here.
Thank you for your hard work. I have a question while attempting to train with your code.
During the training of the duration predictor, I noticed that the "loss_dur" fluctuates significantly compared to previous work. Upon investigation, I found that "grad_norm_dur_disc" is spiking very high. In my opinion, this might be due to the adversarial loss being calculated for a single batch, which is much larger compared to the weights, especially in contrast to the few convolution layers in the discriminator.
As far as I know, in HiFiGAN, the discriminator is composed of several sub-discriminators. Therefore, I understand that there is a for loop inside the "discriminator_loss" and "generator_loss" functions to calculate the loss for each sub-discriminator.
Since the "DurationDiscriminator" you implemented does not consist of sub-discriminator layers, when calculating the loss in the "discriminator_loss" and "generator_loss" functions, it is computed for a single batch size and then summed without any scaling.
In my opinion, this might make the training of the "DurationDiscriminator", which is composed of very small parameters, unstable. I'm curious if this was intentional, to calculate it for a single batch size without scaling. If not, I'm also wondering if it would be acceptable to input in list form within the append() in the discriminator forward pass. Currently, I'm training the model with the relu non-linearity and layernorm that you have written but commented out and the list form. If i get good result, i will share it this issue.