huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.36k stars 742 forks source link

Errata on "Illustrating Reinforcement Learning from Human Feedback (RLHF)" #1292

Closed Voyz closed 1 year ago

Voyz commented 1 year ago

https://huggingface.co/blog/rlhf

Background

In the section on the third step of the process, it is written:

I'm confused, because based on this information:

I may not understand this fully, but my current logic leads me to think that these statements are contradicting.

I tried to research the RL fine-tuning process a bit and asked ChatGPT to brainstorm with me about this contradiction. From what I've gathered the parameters of the original LM are frozen not because it is prohibitively expensive (as we're doing that anyway with the copy), but because:

Hence, is the following a correct reasoning on freezing during fine-tuning?

This way, the statement about 'fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive' would be coherent with the rest of the information provided.

Suggestions

If that reasoning is correct, then I think this sentence:

fine-tuning some or all of the parameters of a copy of the initial LM

Would probably make more sense if it didn't state 'or all'.

And additionally, the following statement:

Parameters of the LM are frozen because fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive

Would make more sense if it started with 'Some parameters of the copy of the LM...'.

And potentially I'd suggest adding an explanation regarding the original LM freezing, along the lines of:

Parameters of the original LM are frozen in order to retain the valuable knowledge and language understanding acquired during pretraining, while making targeted adjustments to align the copy of the LM with specific objectives or human preferences

If there's something I'm missing here, please accept my apology for these suggestions and let me know what I got wrong.

U-n-Own commented 1 year ago

Hi i got here just because i had your same doubts, so i went and read my notes from classes where the teacher spoke about RLHF in GPT 3 was done.

So here what i can guess the writer could have been trying to say: Initialize a copy of the model $p_{\theta}^{RL}(s)$ with parameter $\theta$ then you have your fine tuned model and use TRPO (those with KL divergence) or simply PPO that is less computational intensive, in this way you're modifying not all the weights of your network, because this isn't really a finetuning for what i understood.

osanseviero commented 1 year ago

cc @lewtun @natolambert

natolambert commented 1 year ago

There's a lot of things that could be improved in this, let me spin up a PR. It's about time for it to be freshened up :)

natolambert commented 1 year ago

I don't know if I agree with this statement:

Parameters of the original LM are frozen in order to retain the valuable knowledge and language understanding acquired during pretraining, while making targeted adjustments to align the copy of the LM with specific objectives or human preferences.

As I think it is expected that RLHF is retaining this information in all forms. Without that, it would fail (and the KL constraint would show that)