Closed drei34 closed 10 months ago
You are correct in your statement that if the classification head is called "score"
, it should be added as such to modules_to_save
. If you print the PEFT model, if everything works as expected, you should see that the classification head was replaced by PEFT with a ModulesToSaveWrapper
.
Now regarding the question why you see NaNs, this is hard to answer. It could be that the training parameters are not well chosen. Just as an example, if the learning rate is too high, we could see NaNs in the output. When you exclude "score"
from modules_to_save
, we don't have any fully fine-tuned layers, so the same learning rate may not produce NaNs. Did you check all the usual settings that can help with stabilizing training?
Thank you for the comment, really appreciate it. I tried LR = 0.0000000001 etc and this still does not work. Still get NaN. I then tried to go into the gears manually and have some answers/questions for you below:
lora_model = get_peft_model(model, config)
or model? Both seem to change. For example here are what the model looks like for GPT2 before and after.Original model:
model after and notive 'score' is wrapped:
lora_model; this is a PERF model and the 'score' is wrapped:
The config and the call we use to get the model; the layers are the Conf1d or Linear layers inside of the model:
config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=["c_attn", "c_proj", "c_fc", "c_proj"],
lora_dropout=0.1,
bias="none",
modules_to_save=["score"],
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()
target_modules
right? But it is not really fine tuned.p = p - LR*p.grad
to control the LR per layer even. When I do this, for the 'score' layer, I see None for the gradient. I get the print out below.Here I am printing "name p.shape p.requires_grad p.grad" ... You see that score.modules_to_save.default.weight
has None as it's grad. If I use optimizer.step()
directly I get NaN loss. Should I skip this layer in my manual update? What is the difference between original_module and modules_to_save?
score.original_module.weight torch.Size([3, 768]) True tensor([[-0.2822, -0.3643, 0.5391, ..., 0.5933, -0.6211, 0.1171],
[-0.1940, 0.0462, 0.0337, ..., -0.0362, -0.1427, -0.0820],
[ 0.4763, 0.3179, -0.5723, ..., -0.5571, 0.7637, -0.0352]],
device='cuda:0', dtype=torch.float16)
score.modules_to_save.default.weight torch.Size([3, 768]) True None
Thank you for investigating further. First of all, you should use the return value from get_peft_model
. It is true that the model
that you pass is getting modified as well, but the PeftModel
you get back is the one to use (it has many helpful methods added on top).
The model reprs that you show all look quite correct, so I don't think there is an issue with your LoRA config. In theory, it would be possible to tune the score
layer by adding it as a LoRA target instead of a modules_to_save
, but this is usually a bad idea: This layer is initialized totally randomly, so changing it "a little bit", as would be the case with LoRA, is unlikely to lead to success.
Regarding the score
layer, it is necessary to understand a little bit the implementation. When you add a layer to modules_to_save
, we basically create a copy of it and use the copy instead of the original weights. The copy gets updated, the original weights stay the same. This allows users to later switch back and forth between original model and fine-tuned model.
What's strange in your final output is that apparently, the original weights have requires_grad = True
and a gradient (if I read it correctly). Normally, this shouldn't happen, they should not be updated. Could you please check that everything is set correctly right after you created the PEFT model? It should be something like this:
lora_model = get_peft_model(model, config)
lora_model.base_model.model.score.original_module.requires_grad # should be False
lora_model.base_model.model.score.modules_to_save["default"].weight.requires_grad # should be True
lora_model.base_model.model.score.active_adapter # should be 'default'
torch.allclose(
lora_model.base_model.model.score.modules_to_save["default"].weight,
lora_model.base_model.model.score.original_module.weight
) # should be True
Hey! Thank you again. Actually your 3 lines above crash but I'm sure this is working. When I add the module to save as "score" and use print trainable parameters I get more parameters than if I use a random name like "classifier" - so the head is being used as I want.
One question I have: do you HAVE to use the perf model? I.e. would the original give wrong answers?
I'll explain step by step with pictures the other comments you made.
lora_model.base_model.model.score.original_module.requires_grad # should be False lora_model.base_model.model.score.modules_to_save["default"].weight.requires_grad # should be True lora_model.base_model.model.score.active_adapter # should be 'default'
The first command above crashes with an error as below. This object has requires_grad_
throguh but not aure that's what you want as its a method. I think you mean the "weight" matrix though and this is True (!). The other two items are True and 'default' as you say.
Possibly related to this, I was trying to fine tune LLAMA2 and GPT2 to see the lift when we go 120 M -> 7B parameters (and use LORA for LLAMA2 since we can't fine tune directly). When I was doing just fine tuning with GPT2 I was getting NaN loss but before (in some older work) I was not. I looked at the notebook carefully and I found that I was specifying torch.dtype=torch.float16
. When I removed this, I no longer get NaN loss in my runs. I have not tried LORA + LLAMA2 yet with this changed but it might be related but I'll circle back. I also pasted the version of things I am running. Might this be a problem?
Thanks for checking. Indeed, my code was missing the .weight
attribute, as you correctly guessed. What is strange to me is that the original module has requires_grad = True
. Is that straight after you created the model using get_peft_model
? Could you please disable gradients on that module?
lora_model.base_model.model.score.original_module.requires_grad_(False)
I looked at the notebook carefully and I found that I was specifying torch.dtype=torch.float16. When I removed this, I no longer get NaN loss in my runs.
Yes, fp16 can more easily result in numerical instabilities. What are you using for training the model? Transformers Trainer
or some other framework or custom code? Could you share it?
I also pasted the version of things I am running. Might this be a problem?
Newer versions are generally better because there might have been bug fixes in-between. Could you please try the latest PEFT version?
do you HAVE to use the perf model? I.e. would the original give wrong answers?
You could use the original model (which is modified by PEFT) and it should work, you will just miss out on certain features, like saving the adapter weights or merging them into the original weights.
Hey yes that was straight after. I.e. this is the code I am running.
model_name = 'gpt2'
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels,
device_map='cuda'
)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
use_fast=False,
trust_remote_code=True,
device_map='cuda'
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model.resize_token_embeddings(len(tokenizer)) # https://github.com/huggingface/transformers/issues/1805
if model_name == 'gpt2':
model.config.pad_token_id = model.config.eos_token_id
config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=["c_attn", "c_proj", "c_fc", "c_proj"],
lora_dropout=0.1,
bias="none",
modules_to_save=["score"],
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()
lora_model.base_model.model.score.weight.original_module.requires_grad = False
My custom trainer codee is below I think it's pretty standard ... I wrote this to debug all the above (lol) but now maybe we can just use Trainer. But Trainer is very hard to debug for me, it abtracts so much away.
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
if use_cuda:
model = model.cuda()
loss_fn = nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
accuracy_metric = Accuracy(task="multiclass", num_classes=3).to(device)
epochs = 10
save_directory = "./gpt2_checkpoints" # Define the directory where you want to save your models
for epoch in range(epochs):
# Training
print(f"Epoch: {epoch+1}")
model.train()
total_train_loss = 0
total_train_correct = 0
total_train_count = 0
for batch in tqdm(train_dataloader):
model.zero_grad()
optimizer.zero_grad()
input_ids = batch['input_ids'].squeeze(1).to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
loss = loss_fn(outputs.logits, labels)
loss.backward()
optimizer.step()
total_train_loss += loss.item()
preds = torch.argmax(outputs.logits, dim=1)
total_train_correct += accuracy_metric(preds, labels).item() * len(labels)
total_train_count += len(labels)
# Validation
model.eval()
total_val_loss = 0
total_val_correct = 0
total_val_count = 0
with torch.no_grad():
for batch in val_dataloader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
loss = loss_fn(outputs.logits, labels)
total_val_loss += loss.item()
preds = torch.argmax(outputs.logits, dim=1)
total_val_correct += accuracy_metric(preds, labels).item() * len(labels)
total_val_count += len(labels)
print(f"Epoch {epoch+1}/{epochs} - Train Loss: {total_train_loss/len(train_dataloader):.4f}, Train Accuracy: {total_train_correct/total_train_count:.4f}, Validation Loss: {total_val_loss/len(val_dataloader):.4f}, Validation Accuracy: {total_val_correct/total_val_count:.4f}")
# Saving model and tokenizer after each epoch
save_path = f"{save_directory}/model_epoch_{epoch+1}"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved at: {save_path}")
model.save_pretrained(save_directory + "/final_model")
Thanks for providing more code. Interestingly, it seems that I get somewhat different results:
...
lora_model = get_peft_model(model, config)
print(lora_model.base_model.model.score.original_module.weight.requires_grad)
# prints False
while you mentioned you got True
. I also wanted to run the rest of you code, but I don't know what data you're using, so that was not possible.
Anyway, I think the best course of action for you would be to try a newer PEFT version.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hey! Thank you again. Actually your 3 lines above crash but I'm sure this is working. When I add the module to save as "score" and use print trainable parameters I get more parameters than if I use a random name like "classifier" - so the head is being used as I want.
One question I have: do you HAVE to use the perf model? I.e. would the original give wrong answers?
I'll explain step by step with pictures the other comments you made.
lora_model.base_model.model.score.original_module.requires_grad # should be False lora_model.base_model.model.score.modules_to_save["default"].weight.requires_grad # should be True lora_model.base_model.model.score.active_adapter # should be 'default'
The first command above crashes with an error as below. This object has
requires_grad_
throguh but not aure that's what you want as its a method. I think you mean the "weight" matrix though and this is True (!). The other two items are True and 'default' as you say.Possibly related to this, I was trying to fine tune LLAMA2 and GPT2 to see the lift when we go 120 M -> 7B parameters (and use LORA for LLAMA2 since we can't fine tune directly). When I was doing just fine tuning with GPT2 I was getting NaN loss but before (in some older work) I was not. I looked at the notebook carefully and I found that I was specifying
torch.dtype=torch.float16
. When I removed this, I no longer get NaN loss in my runs. I have not tried LORA + LLAMA2 yet with this changed but it might be related but I'll circle back. I also pasted the version of things I am running. Might this be a problem?
Use "requiresgrad" instead of "requires_grad"
System Info
Hi I am using LLAMA2 and GPT2 for sequence classification. Both models add a "score" layer on top to transform the last embedding of the tokens into a vector of class logits. If I specify the layers are below for GPT2 or LLAMA2, I see NaN for the rain and validation accuracy. If I use "classifier", which is not the name of a layer, everything "works" in the senese that I get a loss and accuracy improves but as I understanding it the classifier head is just random numbers, so all other parameters are changing to try and circumvent this layer.
Also, we can put "score" in "target_modules" and this works, but this should not be what we do if I understand right. This layer has no information of value in it, so it should be properly fine tuned.
Any ideas on what is wrong?
Who can help?
No response
Information
Tasks
examples
folderReproduction
Expected behavior