openai / Video-Pre-Training

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
MIT License
1.27k stars 140 forks source link

Rationale for KL decay #44

Open Rolv-Arild opened 4 months ago

Rolv-Arild commented 4 months ago

Hi, just a small question about the choice of making the KL term decay in importance.

In the paper you describe that high values of the KL divergence coefficient limit the policy, while low values make it forget useful stuff from the BC model. Thus you set it to decay gradually. I'm just wondering, can't this lead to having kind of the worst of both worlds, where it only hits a good range for a brief period? Did you try setting it to a constant in-between value instead of decaying?

brandonhoughton commented 4 months ago

The thinking is you get rotations at the beginning of training that encourage more general refinement later in training. Not sure about the ablation you proposed being run

Rolv-Arild commented 3 months ago

Did you try any other variations? Lower starting coefficient? Faster decay?