modules_to_save not working for AutoModelForSequenceClassification

System Info

Hi I am using LLAMA2 and GPT2 for sequence classification. Both models add a "score" layer on top to transform the last embedding of the tokens into a vector of class logits. If I specify the layers are below for GPT2 or LLAMA2, I see NaN for the rain and validation accuracy. If I use "classifier", which is not the name of a layer, everything "works" in the senese that I get a loss and accuracy improves but as I understanding it the classifier head is just random numbers, so all other parameters are changing to try and circumvent this layer.

Also, we can put "score" in "target_modules" and this works, but this should not be what we do if I understand right. This layer has no information of value in it, so it should be properly fine tuned.

Any ideas on what is wrong?

config = LoraConfig(
    r=16,
    lora_alpha=16,
    # These are the LLAMA2 layers
    #target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    # These are the GPT2 layers
    target_modules=["c_attn", "c_proj", "c_fc", "c_proj"],
    # We can put "score" in here and then this layer is fine tuned with LORA, but it should be fine tuned since the values in this layer are all random
    #target_modules=["c_attn", "c_proj", "c_fc", "c_proj", "score"]
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["score"], # This gives NaN loss
    # The below works, but this "classifier" layer is not in the model so effectively nothing happens
    # modules_to_save=["score"]
)

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import torch
from torch import nn
import numpy as np
import re
from transformers import LlamaTokenizer, LlamaForCausalLM
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, LlamaForSequenceClassification
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
)
from tqdm import tqdm
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchmetrics import Accuracy
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader, random_split
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader, random_split

model_name = 'TinyPixel/Llama-2-7B-bf16-sharded'
num_labels = 3 

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=num_labels, torch_dtype=torch.float16, device_map='auto')

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=False,
    trust_remote_code=True,
    device_map='auto'
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left" # Fix weird overflow issue with fp16 training
model.resize_token_embeddings(len(tokenizer)) # https://github.com/huggingface/transformers/issues/1805

# Pass this down into whatever training loop you have
# modules_to_save should be called "score" but this produces NaN loss
# If we use any other string, optimization works but it does not make sense to me
config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["score"],
)
lora_model = get_peft_model(model, config)
print(lora_model.print_trainable_parameters())

Expected behavior

The above should fine tune the modules_to_save not give NaN los.
If the modules_to_save is NOT in the model, maybe this should crash? You can specify it as "random layer" and this will all work but it does nothing to this layer since it is not in the model.

You are correct in your statement that if the classification head is called "score", it should be added as such to modules_to_save. If you print the PEFT model, if everything works as expected, you should see that the classification head was replaced by PEFT with a ModulesToSaveWrapper.

Now regarding the question why you see NaNs, this is hard to answer. It could be that the training parameters are not well chosen. Just as an example, if the learning rate is too high, we could see NaNs in the output. When you exclude "score" from modules_to_save, we don't have any fully fine-tuned layers, so the same learning rate may not produce NaNs. Did you check all the usual settings that can help with stabilizing training?

Thank you for the comment, really appreciate it. I tried LR = 0.0000000001 etc and this still does not work. Still get NaN. I then tried to go into the gears manually and have some answers/questions for you below:

Are we supposed to use the lora_model that's returned from lora_model = get_peft_model(model, config) or model? Both seem to change. For example here are what the model looks like for GPT2 before and after.

Original model:

model after and notive 'score' is wrapped:

lora_model; this is a PERF model and the 'score' is wrapped:

The config and the call we use to get the model; the layers are the Conf1d or Linear layers inside of the model:

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["c_attn", "c_proj", "c_fc", "c_proj"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["score"],
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

I guess I can LORA fine tune 'score' by putting it in target_modules right? But it is not really fine tuned.
I tried to go into my loop and literally get p.grad for each parameter. I then wanted to manually do p = p - LR*p.grad to control the LR per layer even. When I do this, for the 'score' layer, I see None for the gradient. I get the print out below.

Here I am printing "name p.shape p.requires_grad p.grad" ... You see that score.modules_to_save.default.weight has None as it's grad. If I use optimizer.step() directly I get NaN loss. Should I skip this layer in my manual update? What is the difference between original_module and modules_to_save?

score.original_module.weight torch.Size([3, 768]) True tensor([[-0.2822, -0.3643,  0.5391,  ...,  0.5933, -0.6211,  0.1171],
        [-0.1940,  0.0462,  0.0337,  ..., -0.0362, -0.1427, -0.0820],
        [ 0.4763,  0.3179, -0.5723,  ..., -0.5571,  0.7637, -0.0352]],
       device='cuda:0', dtype=torch.float16)

score.modules_to_save.default.weight torch.Size([3, 768]) True None

Thank you for investigating further. First of all, you should use the return value from get_peft_model. It is true that the model that you pass is getting modified as well, but the PeftModel you get back is the one to use (it has many helpful methods added on top).

The model reprs that you show all look quite correct, so I don't think there is an issue with your LoRA config. In theory, it would be possible to tune the score layer by adding it as a LoRA target instead of a modules_to_save, but this is usually a bad idea: This layer is initialized totally randomly, so changing it "a little bit", as would be the case with LoRA, is unlikely to lead to success.

Regarding the score layer, it is necessary to understand a little bit the implementation. When you add a layer to modules_to_save, we basically create a copy of it and use the copy instead of the original weights. The copy gets updated, the original weights stay the same. This allows users to later switch back and forth between original model and fine-tuned model.

What's strange in your final output is that apparently, the original weights have requires_grad = True and a gradient (if I read it correctly). Normally, this shouldn't happen, they should not be updated. Could you please check that everything is set correctly right after you created the PEFT model? It should be something like this:

lora_model = get_peft_model(model, config)
lora_model.base_model.model.score.original_module.requires_grad  # should be False
lora_model.base_model.model.score.modules_to_save["default"].weight.requires_grad  # should be True
lora_model.base_model.model.score.active_adapter  # should be 'default'
torch.allclose(
    lora_model.base_model.model.score.modules_to_save["default"].weight, 
    lora_model.base_model.model.score.original_module.weight
)  # should be True

Hey! Thank you again. Actually your 3 lines above crash but I'm sure this is working. When I add the module to save as "score" and use print trainable parameters I get more parameters than if I use a random name like "classifier" - so the head is being used as I want.

One question I have: do you HAVE to use the perf model? I.e. would the original give wrong answers?

I'll explain step by step with pictures the other comments you made.

lora_model.base_model.model.score.original_module.requires_grad # should be False lora_model.base_model.model.score.modules_to_save["default"].weight.requires_grad # should be True lora_model.base_model.model.score.active_adapter # should be 'default'

The first command above crashes with an error as below. This object has requires_grad_ throguh but not aure that's what you want as its a method. I think you mean the "weight" matrix though and this is True (!). The other two items are True and 'default' as you say.

Possibly related to this, I was trying to fine tune LLAMA2 and GPT2 to see the lift when we go 120 M -> 7B parameters (and use LORA for LLAMA2 since we can't fine tune directly). When I was doing just fine tuning with GPT2 I was getting NaN loss but before (in some older work) I was not. I looked at the notebook carefully and I found that I was specifying torch.dtype=torch.float16. When I removed this, I no longer get NaN loss in my runs. I have not tried LORA + LLAMA2 yet with this changed but it might be related but I'll circle back. I also pasted the version of things I am running. Might this be a problem?

Thanks for checking. Indeed, my code was missing the .weight attribute, as you correctly guessed. What is strange to me is that the original module has requires_grad = True. Is that straight after you created the model using get_peft_model? Could you please disable gradients on that module?

lora_model.base_model.model.score.original_module.requires_grad_(False)

I looked at the notebook carefully and I found that I was specifying torch.dtype=torch.float16. When I removed this, I no longer get NaN loss in my runs.

Yes, fp16 can more easily result in numerical instabilities. What are you using for training the model? Transformers Trainer or some other framework or custom code? Could you share it?

I also pasted the version of things I am running. Might this be a problem?

Newer versions are generally better because there might have been bug fixes in-between. Could you please try the latest PEFT version?

do you HAVE to use the perf model? I.e. would the original give wrong answers?

You could use the original model (which is modified by PEFT) and it should work, you will just miss out on certain features, like saving the adapter weights or merging them into the original weights.

Hey yes that was straight after. I.e. this is the code I am running.

model_name = 'gpt2'
num_labels = 3 

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    device_map='cuda'
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=False,
    trust_remote_code=True,
    device_map='cuda'
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model.resize_token_embeddings(len(tokenizer)) # https://github.com/huggingface/transformers/issues/1805
if model_name == 'gpt2':
    model.config.pad_token_id = model.config.eos_token_id

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["c_attn", "c_proj", "c_fc", "c_proj"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["score"],
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

lora_model.base_model.model.score.weight.original_module.requires_grad = False

My custom trainer codee is below I think it's pretty standard ... I wrote this to debug all the above (lol) but now maybe we can just use Trainer. But Trainer is very hard to debug for me, it abtracts so much away.

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

if use_cuda:
    model = model.cuda()
    loss_fn = nn.CrossEntropyLoss().cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
accuracy_metric = Accuracy(task="multiclass", num_classes=3).to(device)
epochs = 10
save_directory = "./gpt2_checkpoints" # Define the directory where you want to save your models
for epoch in range(epochs):
    # Training
    print(f"Epoch: {epoch+1}")
    model.train()
    total_train_loss = 0
    total_train_correct = 0
    total_train_count = 0
    for batch in tqdm(train_dataloader):
        model.zero_grad()
        optimizer.zero_grad()

        input_ids = batch['input_ids'].squeeze(1).to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs.logits, labels)
        loss.backward()

        optimizer.step()

        total_train_loss += loss.item()
        preds = torch.argmax(outputs.logits, dim=1)
        total_train_correct += accuracy_metric(preds, labels).item() * len(labels)
        total_train_count += len(labels)

    # Validation
    model.eval()
    total_val_loss = 0
    total_val_correct = 0
    total_val_count = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            loss = loss_fn(outputs.logits, labels)
            total_val_loss += loss.item()
            preds = torch.argmax(outputs.logits, dim=1)
            total_val_correct += accuracy_metric(preds, labels).item() * len(labels)
            total_val_count += len(labels)
    print(f"Epoch {epoch+1}/{epochs} - Train Loss: {total_train_loss/len(train_dataloader):.4f}, Train Accuracy: {total_train_correct/total_train_count:.4f}, Validation Loss: {total_val_loss/len(val_dataloader):.4f}, Validation Accuracy: {total_val_correct/total_val_count:.4f}")

    # Saving model and tokenizer after each epoch
    save_path = f"{save_directory}/model_epoch_{epoch+1}"
    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)
    print(f"Model and tokenizer saved at: {save_path}")
model.save_pretrained(save_directory + "/final_model")

Thanks for providing more code. Interestingly, it seems that I get somewhat different results:

...
lora_model = get_peft_model(model, config)
print(lora_model.base_model.model.score.original_module.weight.requires_grad)
# prints False

while you mentioned you got True. I also wanted to run the rest of you code, but I don't know what data you're using, so that was not possible.

Anyway, I think the best course of action for you would be to try a newer PEFT version.

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Hey! Thank you again. Actually your 3 lines above crash but I'm sure this is working. When I add the module to save as "score" and use print trainable parameters I get more parameters than if I use a random name like "classifier" - so the head is being used as I want.

One question I have: do you HAVE to use the perf model? I.e. would the original give wrong answers?

I'll explain step by step with pictures the other comments you made.

lora_model.base_model.model.score.original_module.requires_grad # should be False lora_model.base_model.model.score.modules_to_save["default"].weight.requires_grad # should be True lora_model.base_model.model.score.active_adapter # should be 'default'

The first command above crashes with an error as below. This object has requires_grad_ throguh but not aure that's what you want as its a method. I think you mean the "weight" matrix though and this is True (!). The other two items are True and 'default' as you say.

Possibly related to this, I was trying to fine tune LLAMA2 and GPT2 to see the lift when we go 120 M -> 7B parameters (and use LORA for LLAMA2 since we can't fine tune directly). When I was doing just fine tuning with GPT2 I was getting NaN loss but before (in some older work) I was not. I looked at the notebook carefully and I found that I was specifying torch.dtype=torch.float16. When I removed this, I no longer get NaN loss in my runs. I have not tried LORA + LLAMA2 yet with this changed but it might be related but I'll circle back. I also pasted the version of things I am running. Might this be a problem?

Use "requiresgrad" instead of "requires_grad"

huggingface / peft