Errata on "Illustrating Reinforcement Learning from Human Feedback (RLHF)"

Voyz commented 1 year ago

https://huggingface.co/blog/rlhf

Background

In the section on the third step of the process, it is written:

What multiple organizations seem to have gotten to work is fine-tuning some or all of the parameters of a copy of the initial LM with a policy-gradient RL algorithm, Proximal Policy Optimization (PPO).
Parameters of the LM are frozen because fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive ...
First, the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text)

I'm confused, because based on this information:

Fine-tuning an entire model is prohibitively expensive.
Yet, the policy for the RL fine-tuning is the language model, which is a copy of the original LM
Fine-tuning of which is prohibitively expensive

I may not understand this fully, but my current logic leads me to think that these statements are contradicting.

I tried to research the RL fine-tuning process a bit and asked ChatGPT to brainstorm with me about this contradiction. From what I've gathered the parameters of the original LM are frozen not because it is prohibitively expensive (as we're doing that anyway with the copy), but because:

It allows the retention and leverage of valuable knowledge and language understanding acquired during pretraining of original LM, while making targeted adjustments to align the copy of the model with specific objectives or human preferences. This would also correlate with what is written later in the article about the KL divergence.
It allows researchers to explore different fine-tuning approaches without modifying the original LM, enabling greater control and experimentation.

Hence, is the following a correct reasoning on freezing during fine-tuning?

We freeze all of the original LM parameters - for these two aforementioned reasons
We freeze some of the copy of the LM parameters - because it's expensive to fine-tune all of them

This way, the statement about 'fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive' would be coherent with the rest of the information provided.

Suggestions

If that reasoning is correct, then I think this sentence:

fine-tuning some or all of the parameters of a copy of the initial LM

Would probably make more sense if it didn't state 'or all'.

And additionally, the following statement:

Parameters of the LM are frozen because fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive

Would make more sense if it started with 'Some parameters of the copy of the LM...'.

And potentially I'd suggest adding an explanation regarding the original LM freezing, along the lines of:

Parameters of the original LM are frozen in order to retain the valuable knowledge and language understanding acquired during pretraining, while making targeted adjustments to align the copy of the LM with specific objectives or human preferences

If there's something I'm missing here, please accept my apology for these suggestions and let me know what I got wrong.

U-n-Own commented 1 year ago

Hi i got here just because i had your same doubts, so i went and read my notes from classes where the teacher spoke about RLHF in GPT 3 was done.

So here what i can guess the writer could have been trying to say: Initialize a copy of the model $p_{\theta}^{RL}(s)$ with parameter $\theta$ then you have your fine tuned model and use TRPO (those with KL divergence) or simply PPO that is less computational intensive, in this way you're modifying not all the weights of your network, because this isn't really a finetuning for what i understood.

osanseviero commented 1 year ago

cc @lewtun @natolambert

natolambert commented 1 year ago

There's a lot of things that could be improved in this, let me spin up a PR. It's about time for it to be freshened up :)

natolambert commented 1 year ago

I don't know if I agree with this statement:

Parameters of the original LM are frozen in order to retain the valuable knowledge and language understanding acquired during pretraining, while making targeted adjustments to align the copy of the LM with specific objectives or human preferences.

As I think it is expected that RLHF is retaining this information in all forms. Without that, it would fail (and the KL constraint would show that)

huggingface / blog

Errata on "Illustrating Reinforcement Learning from Human Feedback (RLHF)" #1292

Background

Suggestions