leimao / Voice-Converter-CycleGAN

Voice Converter Using CycleGAN and Non-Parallel Data
https://leimao.github.io/project/Voice-Converter-CycleGAN/
MIT License
526 stars 127 forks source link

About Loss Curve and timbre of converted voices #8

Open MorganCZY opened 5 years ago

MorganCZY commented 5 years ago

Hi, After listening carefully to converted voices, I found their timbre is not too much like the target's. Then I reviewed all the loss definitions and their training curves. As you posed in this repo's readme, the loss of D is close to 0 and the loss of G is close to 1(as presented in the following picture). dgloss However, if the generator is well trained and can generate vivid samples as the real ones, the loss of Gab or Gba should be close to 0, considering its definition"self.generator_loss_A2B = l2_loss(y = tf.ones_like(self.discrimination_B_fake), y_hat = self.discrimination_B_fake)" . Meanwhile, the D loss values should not be very close to 0 because real and fake(generated) samples are very similar.

leimao commented 5 years ago

Ideally, if the samples generated are realistic enough, the classification accuracy of a perfect discriminator should be around 0.5, because the samples are realistic so much that I could only guess if it is real or not. However, in reality during training, it is often the case very hard to achieve this.

MorganCZY commented 5 years ago

Yes, it's agreed the ideal D loss is scarcely possible to be achieved during training in reality. It, however, should not be so close to 0, but a value around 0.5 is acceptable even with a high variance. Besides, I don't understand why G loss is approaching to 1 considering its definition. When the samples generated are very similar to the real ones, G loss should decline and be a small value around 0 instead.

leimao commented 5 years ago

That is because the discriminator became too strong while the generator was not strong enough. We want to have generator loss to decrease all the time, but some times it does not happen. You can understand the above figure as the following scenario. You are a very good painter, and you just drew Van Gogh's "Sunflowers". You claim it as the real masterpiece from Van Gogh and want to sell it on the market for a good price. Unfortunately, there is a extremely experienced appraiser that identified your work as a fake Van Gogh masterpiece. That does not mean your painting is bad or it does not look like Van Gogh's work, it is because the appraiser is too experienced. From my experience, the discriminator/generator loss could not totally serve as the indicators of whether the model is good or not. But of course, you can definitely try to optimize the hyper-parameters, training protocols, to make the losses look more game theoretically.

MorganCZY commented 5 years ago

Thank you for the vivid metaphor so that I deepen my understanding of this whole work. However, the point I'm struggling with is the timbre issue. I've simulated several mainstream VC methods and found the timbre between converted voices and the target ones is far from satisfactory. Have you ever focused on this issue? Do you have some advises to improve the timbre of converted voices?