training a multi-adapter model in a interactive way

louieworth commented 7 months ago

torch=2.1.0
transformers=4.35.0
peft == 0.7.1

Based on https://huggingface.co/docs/transformers/v4.36.1/en/peft I used to be able to train a multi-adapter model in an interactively way which they are coupled with (same like actor-critic in reinforcement learning):

training.py

 peft_config = LoraConfig(...)
model.add_adapter(peft_config, adapter_name='lora_1')
model.add_adapter(peft_config, adapter_name='lora_2')  

my_trainer=  MyTrainer(
        model=model,
        args=training_args,
        adapters=script_args.adapters,
        beta=script_args.beta,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        max_length=script_args.max_length,
        max_target_length=script_args.max_target_length,
        max_prompt_length=script_args.max_prompt_length,
    )

last_checkpoint = get_last_checkpoint(script_args.output_dir)
my_trainer.train(resume_from_checkpoint=last_checkpoint)

The loss function is coupled with lora_1 and lora_2 so I went to MyTrainer to override the loss function.

class MyTrainer(Trainer):
  def __init__(self, model, ...):
    ...
  def compute_loss(self, inputs, ...):
    # get lora_1_ouput, set lora_2.required_grad=False
    self.model.set_adapter("lora_1")
    lora_1_output = self.model(inputs, ...)

    # get lora_2_ouput and set lora_1.required_grad=False
    self.model.set_adapter("lora_2")
    loss = self.model(lora_1_ouput, inputs, ...)

    if train_lora_1:
        self.model.set_adapter("lora_1")
    else:
        self.model.set_adapter("lora_2")

    return loss

However, I met the error in

assert (
            len(optimizer_state["found_inf_per_device"]) > 0
        ), "No inf checks were recorded for this optimizer."

Exception has occurred: AssertionError
No inf checks were recorded for this optimizer.
  File "/home/lijiang/miniconda3/envs/trl/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py", line 412, in step
    assert (
  File "/home/lijiang/miniconda3/envs/trl/lib/python3.11/site-packages/accelerate/optimizer.py", line 132, in step
    self.scaler.step(self.optimizer, closure)
  File "/home/lijiang/miniconda3/envs/trl/lib/python3.11/site-packages/transformers/trainer.py", line 1911, in _inner_training_loop
    self.optimizer.step()
  File "/home/lijiang/miniconda3/envs/trl/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/lijiang/trl/examples/summarization/my_training.py", line 580, in <module>
    my_trainer.train(resume_from_checkpoint=last_checkpoint)
AssertionError: No inf checks were recorded for this optimizer.

I followed the suggestion on #1303 from @BenjaminBossan to set required_grad=True for all layers with:

last_checkpoint = get_last_checkpoint(script_args.output_dir)
# [mod.requires_grad_(True) for n, mod in my_trainer.model.named_modules() if "lora" in n]
model.set_adapter(['lora_1', 'lora_2'])
my_trainer.train(resume_from_checkpoint=last_checkpoint)

It can run smoothly without an optimizer bug, but I think it trains the both lora_1 and lora_2 models. I can also set lora_2 model, and then there is also no optimizer bug. However, I think these two operations can not be implemented to only update lora_1 model.

last_checkpoint = get_last_checkpoint(script_args.output_dir)
# [mod.requires_grad_(True) for n, mod in my_trainer.model.named_modules() if "lora" in n]
model.set_adapter(['lora_1', 'lora_2'])
my_trainer.train(resume_from_checkpoint=last_checkpoint)

Followed up

    if train_lora_1:
        self.model.set_adapter("lora_1")
    else:
        self.model.set_adapter("lora_2")

    return loss

Summary of problem:

If I only set the self.model.set_adapter("lora_2") in the last line of the compute loss function, the whole training process runs without bugs. (all layers are in the optimizer states, but lora_1.require_grad=False and lora_2.require_grad=True)
However, when I set self.model.set_adapter("lora_1"), it raises this kind of bug. (all layers are in the optimizer states, but lora_1.require_grad=True and lora_2.require_grad=False to reach the goal with only training on lora_1 model).
Who can help?

@BenjaminBossan @pacman100 @younesbelkada

Information

[x] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

I may provide a code sample if it is necessary

Expected behavior

optimization step successful

BenjaminBossan commented 7 months ago

Hi, thanks for bringing this up. I'm not sure if I 100% understand every step you did, but could you try the following:

Since Trainer registers the parameters for the optimizer only once, at the start, we need to ensure that all parameters that require grads (i.e. the adapter weights for lora 1 and lora 2) are active when we initialize the Trainer first. So at the very beginning, call model.set_adapter(['lora_1', 'lora_2']) before initalizing the Trainer. Next, each time you want to only train one of the two adapters, please activate it using model.set_adapter(['lora_1']) or model.set_adapter(['lora_2']). This should ensure that only the activate adapter gets updated. Did you try this?

louieworth commented 7 months ago

Hi, Thanks for your prompt discussion. I have added all parameters to the Trainer with

[mod.requires_grad_(True) for n, mod in my_trainer.model.named_modules() if "lora" in n]
model.set_adapter(['lora_1', 'lora_2'])

However, I think the tricky thing is that the two models are coupled training, i.e., the output from lora_1 is input for lora_2. And for my best understanding of the source code, the set_adapter configuration is to set required_grad=True to its associated adapter but required_grad=False to others.

So, I can not do this, because whenever I set-adapter('lora_2') will set the require_grad=False to the lora_1 model. This may cause some problems. I also tried to explicitly set require_grad=True but it also raises some errors.

I think the best solution is to load two models rather than two models to avoid non-trivial problems. I will close this problem.

huggingface / peft