Question - KL Annealing

umautobots / bidirection-trajectory-predicter

The code for Bi-directional Trajectory Prediction (BiTraP).

Other

78 stars 23 forks source link

Hi @MoonBlvd

Typically I have seen that in VAE implementations one slowly ramp up the "beta" for KL loss. It is observed that it is not a good idea to include KL at early stages so you start from beta=0 and then reach to beta=1 after some training steps. beta=1 will mean that you are including full KL loss.

In your implementation, I see different behavior. You use a scheduler for KL weight that increases that weight from 0 to higher value after every batch step ..... up to the max value of 100. Shouldn't it increase up to 1 instead of 100?

Would appreciate it if you could educate on the reasoning behind it and/or if there is a paper out there that talks about using such kind of annealing scheme.

Regards & thanks Kapil

umautobots / bidirection-trajectory-predicter

Question - KL Annealing #4