Training multiple adapters

maxime-louis commented 3 months ago

System Info

transformers version: 4.42.4
Platform: Linux-3.10.0-1160.92.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: NO
Using GPU in script?: YES
GPU type: NVIDIA A100-SXM4-80GB

Who can help?

Hello !

I'm trying to simultaneously train some lora adapters on a model.

I use a syntax as follow: at model initialization init:

  model.add_adapter(peft_config, 'adapter_1')
  model.add_adapter(peft_config, 'adapter_2')

in my model forward:

   model.set_adapter('adapter_1')
   x1 = model(inputs)
   model.set_adapter('adapter_2')
   x2 = model(inputs)
   logits = get_logits(x1, x2) # a function of both ! e.g. x1 + x2.
   return {'loss': loss(logits, label), "logits": logits}

Unlike most examples I found, I don't want to train the adapters 'separately' e.g. to do different tasks, but I want to train them at the same time (in the same trainer/dataset), using the two outputs to optimize a global loss.

I noticed that a call to .parameters() does not get me all parameters (only those of active adapter), so I modified it to gather all parameters from both adapters (I checked, in the end I do get all parameters from both adapters and all of them requires_grad). I used that modification to declare an optimizer which I provided to the trainer. In principle, the training should modify both adapter weights, but sadly only one of the adapters is modified during training

NB: I'm not interested in activating both adapters at the same time at any point

How should I proceed ?

Thanks :)

@muellerzr @ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Not provided at this point.

Expected behavior

I would expect being able to train multiple adapters on the same model at once. Maybe there is a way which I did not find in the documentation.

maxime-louis commented 3 months ago

Ok so after some tests, it seems that adding model.set_adapter(['adapter_1', 'adapter_2']) at the end of the forward method does allow for both adapters to be updated. Can you confirm this is the expected behaviour ?

ArthurZucker commented 2 months ago

I am not sure as this is peft specific, cc @BenjaminBossan !

BenjaminBossan commented 2 months ago

As you correctly observed, @maxime-louis, it is crucial to ensure that requires_grad is enabled for both adapters. For context, this is because by default, Trainer from transformers (which I assume you're using) only passes the parameters with requires_grad=True to the optimizer. When creating two adapters, only one has requires_grad=True so the other adapter won't get any updates.

I noticed that a call to .parameters() does not get me all parameters (only those of active adapter), so I modified it to gather all parameters from both adapters (I checked, in the end I do get all parameters from both adapters and all of them requires_grad). I used that modification to declare an optimizer which I provided to the trainer.

I think you were on the right way, but it's not quite clear to me what you did, so I can't tell why this approach did not work. Generally, however, if you set model.set_adapter(['adapter_1', 'adapter_2']) before passing the model to the Trainer, both adapters should have requires_grad=True and hence the optimizer should be initialized correctly.

it seems that adding model.set_adapter(['adapter_1', 'adapter_2']) at the end of the forward method does allow for both adapters to be updated

Normally, I don't think this should be necessary as long as earlier, it was ensured that both adapters had requires_grad=True -- maybe this is related to the part earlier that I did not understand. If this workaround works for you, I think you can stick with it, but you could also check if my suggestion above is sufficient.

maxime-louis commented 2 months ago

Thank you @BenjaminBossan, @ArthurZucker

Activating both adapters before giving the model to the trainer seems like the way to go and works well. (I haven't tried removing the activation within the forward yet, not sure it's a costly operation though) Maybe this could be part of the (very short!) documentation on adapters :)

Thank you !

BenjaminBossan commented 2 months ago

Glad that it works now.

I agree, maybe a sentence or two could be added here: https://huggingface.co/docs/transformers/v4.43.4/en/peft#train-a-peft-adapter. But it is very much an edge case, as it requires training multiple adapters at the same time and also using Trainer.

maxime-louis commented 2 months ago

Ok thank you for your help, it's clearer now. More documentation is always welcome !

huggingface / transformers