Closed ksachdeva closed 2 years ago
Hi @ksachdeva for the KL annealing, I followed trajectron++'s code to make the compaison with them fair. I think beta is just a weighting parameter and there is no such restrictions saying it cannot go beyong 1. The reason to have it as large as 100 is to enforce more closer prior and recognition networks because the KL loss is very small and may not have enough influence on the total loss.
Hi @MoonBlvd
Typically I have seen that in VAE implementations one slowly ramp up the "beta" for KL loss. It is observed that it is not a good idea to include KL at early stages so you start from beta=0 and then reach to beta=1 after some training steps. beta=1 will mean that you are including full KL loss.
In your implementation, I see different behavior. You use a scheduler for KL weight that increases that weight from 0 to higher value after every batch step ..... up to the max value of 100. Shouldn't it increase up to 1 instead of 100?
Would appreciate it if you could educate on the reasoning behind it and/or if there is a paper out there that talks about using such kind of annealing scheme.
Regards & thanks Kapil