huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.01k stars 1.27k forks source link

Always allow `ref_model=None` #2047

Open qgallouedec opened 2 months ago

qgallouedec commented 2 months ago

Feature request

For optimisation with reference model, in most cases the reference model is the same as the trained model. We should allow the user to specify the ref model only when they don't want to use the trained model.

Currently this is possible, but only when using PEFT, which is very counter-intuitive. And even using this situation, if you want to provide a ref model that is different from the trained model, you have to define force_use_model. Even more counter-intuitive.

Currently

  1. model = ref_model and no peft
DPOTrainer(model=model, ref_model= ref_model)  # where ref_model should be another instance
  1. model = ref_model and peft
DPOTrainer(model=model)
  1. model != ref_model and no peft
DPOTrainer(model=model, ref_model=ref_model)
  1. model != ref_model and peft
args = DPOConfig(force_use_ref_model=True)
DPOTrainer(model=model, ref_model=ref_model, args=args)

Proposed

  1. model = ref_model
DPOTrainer(model=model)
  1. model != ref_model
DPOTrainer(model=model, ref_model=ref_model)

Motivation

Make the lib use more intuitive.

Your contribution

For sure ;)

RylanSchaeffer commented 2 months ago

Quoting from the other issue:

However, handling ref_model/model is pretty tricky currently, maybe wait until https://github.com/huggingface/trl/issues/2047 is solved?

Is there an explanation for why ref_model and model are tricky? If I was to work on this, should I be wary of any challenges that might pop up?

qgallouedec commented 2 months ago

I believe that this may be due to the implementation being carried out in multiple stages: first the initial version, followed by PEFT support, then integration with DeepSpeed... It's probably a good time to re-think it as a whole. However, we must be careful not to introduce any regressions or breaking changes : we must test all the parameter combinations.

RylanSchaeffer commented 2 months ago

In that case, I think it makes sense to just fix the other issue first because the fix for that issue is an equality check, right?

qgallouedec commented 2 months ago

In that case, I think it makes sense to just fix the other issue first because the fix for that issue is an equality check, right?

Perhaps you should give it a try. It's difficult to assess the changes involved.

qgallouedec commented 1 month ago

Implemented for Online DPO in #2041. It can probably be taken as reference