Closed ssintelli closed 1 year ago
Replacing the core RL algorithm might be too far fetched. Instead of replacing the core RL based algorithm I will see if I can actually apply the above naive supervised end to end reward maximization as guide to PPO as auxiliary task in RLHF Trainer. Sorry if it sounds vague. I will try to work on it on weekends. Also I need to go over all files in repo in details and recent literatures in incorporating human feedbacks to improve LLM so I might be completely wrong with my approach.
Yes, you are correct, if simpler approach worked they (OpenAI) would have already tried. The approach I suggested is similar to PPLM.
As RL dynamics might be hard to tame for bigger models, maybe we can use PPLM as auxiliary guidance in addition to PPO, which I hope has minimal overhead as you have implemented much of ingredients already in the codebase.
We can have a schedule where PPLM can have weight of say 0.5 and it can decay to 0.0.
For now, I am closing the issue, if I get some good results using toy huggingface transformers + ppo/pplm combination experiments will post here.
@lucidrains Thanks for awesome work.
@ssintelli haha, you read my deleted post
yea, let us know if you get PPLM working
I am also interested in your experiments, good luck and let me know if you get good results : )
I didn't try on language however, I tried something similar, with image segmentation and creating pseudo feedback as a didn't have supervision data. Results were inconclusive sometimes good sometimes bad. From my experiments what I could figure out is that we need good amount of supervision data, then overfit the generative model for few epochs on supervised fine-tuning task and then optionally use RL with pesudo feedback and supervised feedback for alignment and task specific results. Actually for my PhD thesis I was initially trying to apply RLHF to improve image segmentation but later as per discussion with my supervisor and some crude experiments paused it for later.
I am still so confused, hope I didn't confuse you either. Hope it might help you even if I didn't answer exactly what I intended to do.
At meta level, PPO based RLHF is performing minor adjustments to weights to align with human feedback.
Can we just replace PPO+RLHF with a preference models thats basically a transformer encoder + sigmoid model, trained with BCE. And during finetuning perform a reward maximization by just making the reward model predict 1s?
Sorry, if I am being naive. I do not have much experience with either RL or Large Language Models, but I would like to contribute to write a basic pytorch pipeline to do the following.
RLHFTrainer already implements large parts of 1 and 2. And Reward model is already in place. PaLM is already there. So, hope making changes to just maximize preference where we concatenate the input and response to reward model to predict all 1s wont be much difficult.
There are issues with end to end meaningful gradients backprop due to predictions being beam searched in infererence, but I hope we can fix that with some tricks.