ncsoft / avocodo

Official implementation of "Avocodo: Generative Adversarial Network for Artifact-Free Vocoder" (AAAI2023)
Other
149 stars 19 forks source link

The learning rate #4

Open Selimonder opened 1 year ago

Selimonder commented 1 year ago

Hello,

Thank you for presenting awesome ideas with your work and addressing fundamental issues in previous works.

In the Training Setup section of your paper the learning rate is mentioned as 2e-3 whereas your implementation usws 2e-4.

2e-4 sounds more reasonable (due to hifigan baseline). However, I couldn't achieve a balanced training using this value, which always ended up with slight metallic artifact.

I am 1M steps in with 2e-3 and it looks better - but I still have doubts around it.

Can you explain the discrepancy?

Thank you

WhiteFu commented 1 year ago

Hello,

Thank you for presenting awesome ideas with your work and addressing fundamental issues in previous works.

In the Training Setup section of your paper the learning rate is mentioned as 2e-3 whereas your implementation usws 2e-4.

2e-4 sounds more reasonable (due to hifigan baseline). However, I couldn't achieve a balanced training using this value, which always ended up with slight metallic artifact.

I am 1M steps in with 2e-3 and it looks better - but I still have doubts around it.

Can you explain the discrepancy?

Thank you

Hello, I have the same question.

Judging by the results of your experiment,How is the sound quality of avocodo compared with hifigan? Is there a suitable super parameter recommendation, such as learning_rate, 2e-3 or 2e-4?

I am looking forward to your reply

Selimonder commented 1 year ago

Hello,

The learning rate of 2e-4 worked better at the end (the official implementation is also using this value). Perhaps there is a typo in the paper.

The default setup did not worked well for my training setup. The following additions helped:

The sound quality of avocodo sounds is overall good and theres a reduction on artefacts.

DeepLatte commented 1 year ago

Hello,

Thank you for presenting awesome ideas with your work and addressing fundamental issues in previous works.

In the Training Setup section of your paper the learning rate is mentioned as 2e-3 whereas your implementation usws 2e-4.

2e-4 sounds more reasonable (due to hifigan baseline). However, I couldn't achieve a balanced training using this value, which always ended up with slight metallic artifact.

I am 1M steps in with 2e-3 and it looks better - but I still have doubts around it.

Can you explain the discrepancy?

Thank you

Thank you for your interest in our paper. As you said, there is a typo in the paper. The learning rate should have mentioned as 2e-4. Imaging artifacts could be remained at lower training step, but it's going to be suppressed as training goes on. I think generator concentrates lower frequency components first. After that, it starts to learn how to suppress such artifacts.

Looking at your solutions, it seems that there exist cases that the discriminator failed at training. In our case, some of the discriminators failed to learn when the size of dataset was very small. It makes artifacts occur in outputs. We tried to prevent the failure by adjusting the discriminator's parameters. The solutions you have suggested also seems to be a good solution to the problem. Thanks for the suggestion.

WhiteFu commented 1 year ago

Hello,

The learning rate of 2e-4 worked better at the end (the official implementation is also using this value). Perhaps there is a typo in the paper.

The default setup did not worked well for my training setup. The following additions helped:

  • Randomly skip discriminator optimization steps (10% chance)
  • Lower learning rate for the discriminator

The sound quality of avocodo sounds is overall good and theres a reduction on artefacts.

Thanks for your reply, I will try it.