Closed janpawlowskiof closed 1 year ago
@younesbelkada can you reproduce? negative KL is very suspicious!
I was able to reproduce, will investigate!
The culprit seems to be b5cce0d13e95c8e21eb9d57177930253d9092a02 This PR introduced batched generation on the t5 example, as it can be observed on the wandb log the kl is negative, I can confirm the KL was always positive before that commit
@younesbelkada Understood the bug. I should have checked more diligently.
No problem at all @GauravVirmani ! Don't worry about that as it can happen to anyone! It is also my fault as I did not flagged that the KL was negative when running the experiment
I was following the t5-sentiment example in order to run RL training on a custom dataset with a custom metric and it also showed negative KL. So I looked into this issue and the created pull-request, which lead me to I rerun my experiments the same way as @younesbelkada has done in the pull-request:
younesbelkada:fix-t5-neg-kl
branch (aborted this run as soon as I saw negative KL warning)Unfortunately, none of this fixed the issue. My knowledge about PPO is limited, so I cannot contribute much to the discussion about the underlying issue, but I hope that this information might be useful. Also, I'll be grateful if you point-out an error that I could have committed.
Thanks a lot for your effort working on this amazing library!
Hi @chizhikchi Thanks for the heads up and for your words! Sadly your wandb reports seems to be private so we can't see it We will definitely investigate that, can you double check the solution proposed by @janpawlowskiof , i.e. try on the 0.4.1 release? Also I would give it a try without batched generation Let us know how it goes!
Hi, @younesbelkada , thank you for the suggestions!
I ran the same experiment on the tix-t5-neg-kl
branch without batched generation and I seemed to work better: the KL got negative on some batches, as can be seen on the graph. I aborted this experiment, because it was giving unpromising results.
Then, I run the same experiment on the 0.4.1 version. KL wasn't negative this time, so the problem seems to be related to batched generation.
My model didn't improve much, though, but I think that's a more of a problem of the reward definition and the complexity of the task.
Hope this information helps! Have a nice day :)
Meaningful experiment. Generating results individually in this code without using batch generation?
response_tensors = ppo_trainer.generate( query_tensors, return_prompt=False, length_sampler=output_length_sampler, **generation_kwargs )
I just rerun the t5-sentiment example and rn on master it has negative kl divergence in new version and does not learn in general. This seems not have happened since the v0.4.1 release.
The t5-sentiment.py script itself does not seem to be the culprit as i tested master with it reverted to v0.4.1 version and the behavior is identical.