Closed Voyz closed 1 year ago
Hi i got here just because i had your same doubts, so i went and read my notes from classes where the teacher spoke about RLHF in GPT 3 was done.
So here what i can guess the writer could have been trying to say: Initialize a copy of the model $p_{\theta}^{RL}(s)$ with parameter $\theta$ then you have your fine tuned model and use TRPO (those with KL divergence) or simply PPO that is less computational intensive, in this way you're modifying not all the weights of your network, because this isn't really a finetuning for what i understood.
cc @lewtun @natolambert
There's a lot of things that could be improved in this, let me spin up a PR. It's about time for it to be freshened up :)
I don't know if I agree with this statement:
Parameters of the original LM are frozen in order to retain the valuable knowledge and language understanding acquired during pretraining, while making targeted adjustments to align the copy of the LM with specific objectives or human preferences.
As I think it is expected that RLHF is retaining this information in all forms. Without that, it would fail (and the KL constraint would show that)
https://huggingface.co/blog/rlhf
Background
In the section on the third step of the process, it is written:
I'm confused, because based on this information:
I may not understand this fully, but my current logic leads me to think that these statements are contradicting.
I tried to research the RL fine-tuning process a bit and asked ChatGPT to brainstorm with me about this contradiction. From what I've gathered the parameters of the original LM are frozen not because it is prohibitively expensive (as we're doing that anyway with the copy), but because:
Hence, is the following a correct reasoning on freezing during fine-tuning?
This way, the statement about 'fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive' would be coherent with the rest of the information provided.
Suggestions
If that reasoning is correct, then I think this sentence:
Would probably make more sense if it didn't state 'or all'.
And additionally, the following statement:
Would make more sense if it started with 'Some parameters of the copy of the LM...'.
And potentially I'd suggest adding an explanation regarding the original LM freezing, along the lines of:
If there's something I'm missing here, please accept my apology for these suggestions and let me know what I got wrong.