huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.35k stars 1.18k forks source link

How to liberate the gpt2 from reference model? #22

Closed yananchen1989 closed 1 year ago

yananchen1989 commented 3 years ago

Hi,

We know that KL is used in the loss as a constraint for the difference between the original gpt2 and the active gpt2 which produces responses for rewards feedbacks. How can I can tune the parameters to mitigate this constraint? I mean I want the active gpt2 can deviate much from the original reference gpt2, as I find in my experiments that the rewards do not improve as expected, possibly due to this constraint. I am new to PPO. Hoping for some suggestions.

Thanks.

lvwerra commented 3 years ago

You could set the init_kl_coeff=0 (see here) to liberate the model from the reference completely or increase the KL target target (which is 6 by default).

yananchen1989 commented 3 years ago

You could set the init_kl_coeff=0 (see here) to liberate the model from the reference completely or increase the KL target target (which is 6 by default).

Thanks.

yananchen1989 commented 3 years ago

By the way, do you have investigations on how to tune the txt_in_len, txt_out_len to better sever the topic/sentiment preservation of the generated texts? Currently, I find that fine-tuning the GPT2 before applying it into generation makes difference.

lvwerra commented 3 years ago

No, I have not experimented much with these parameters. The main motivations for using input text at all is to force some variations in the generation.

Yes, I suspect one gets the best (or rather quickest) performance gains when first using supervised training to bring the initial LM distribution as close to the desired target distribution. This also makes the KL constrained better defined as you measure it against a LM on the same domain.

yananchen1989 commented 2 years ago

@lvwerra Hi, I recently find you that you added a simple code demo here https://lvwerra.github.io/trl// where ppo_config = {'batch_size': 1, 'forward_batch_size': 1}

I suppose this is single sample mode, rather than batch.

Based on your experience, did you find any difference on performance between single and batch mode? Is there any other cautions when using single mode to update the GPT2?

Thanks in advance.

lvwerra commented 2 years ago

Hi @yananchen1989, the simple code demo is just a proof of concept demo and I never used that config for the actual training. I did not run many experiments changing these settings and just sticked to the settings from the original paper.

yananchen1989 commented 2 years ago

@lvwerra Thanks. I find that it is so crucial to design a good reward feedback module that can return a reward with positive or negative value. And the reference GPT also need to be fine-tuned on some related corpus. These two points make it very unpractical.

During my trials, if I do not fine-tune the reference GPT to some texts, (as there are no appropriate texts for finetuning), or only has a reward classifier which only give positive feedbacks, for example, if the generated text is not much like a politics article, the reward module would just score it to, say, 0.001; or on the contrary, if it is much like a politics news, the score would be 0.973, then the generated texts after several iterations of PPO training would deteriorate, ending up into repetitive snippets or meaningless results, even though I have tuned the parameters such as kl coefficients, etc.

lvwerra commented 2 years ago

I think the fine-tuning is not a necessary step but improves stability and convergence. For the reward function, I don't see the point for a strictly positive reward. What would you try to learn from it?