huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
16.57k stars 1.64k forks source link

Using LoRA consumes high memory #1469

Closed WenxiongLiao closed 8 months ago

WenxiongLiao commented 9 months ago

System Info

transformers==4.34.0 torch ==1.13.1 peft==0.5.0 accelerate==0.23.0

Who can help?

@pacman100 @younesbelkada @sayakpaul @stevhliu @MKhalusova

Information

Tasks

Reproduction

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from datasets import load_dataset
from transformers import AutoTokenizer
from peft import LoraConfig, TaskType
from peft import get_peft_model
from transformers import BertForSequenceClassification
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer

#Trainable Params
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params, all_param = 0, 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad: trainable_params += param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable %: {100 * trainable_params / all_param}")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=1, lora_dropout=0.1
)

model = BertForSequenceClassification.from_pretrained(
    'bert-base-cased', 
    num_labels=2
)

model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

training_args = TrainingArguments(output_dir="test_trainer",per_device_train_batch_size = 64,
                                 num_train_epochs=25)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset
)

trainer.train()

Expected behavior

I expected LoRA to consume very little memory, but around 40GB. Maybe there's a problem with my code or a software version?

BenjaminBossan commented 9 months ago

I could reproduce the issue with more current versions of the packages. Indeed using LoRA doesn't save a lot of memory in this case. I think this is because of the model architecture, which is relatively small. When I tried a different one (bloomz-560m) with the rest of the code being identical, I saw quite a big difference in memory usage (4GB vs OOM).

WenxiongLiao commented 9 months ago

I could reproduce the issue with more current versions of the packages. Indeed using LoRA doesn't save a lot of memory in this case. I think this is because of the model architecture, which is relatively small. When I tried a different one (bloomz-560m) with the rest of the code being identical, I saw quite a big difference in memory usage (4GB vs OOM).

Can you share your full code for bloomz-560m with me? I'll test it too

BenjaminBossan commented 9 months ago

Sure, here is the script that I used:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from datasets import load_dataset
from transformers import AutoTokenizer
from peft import LoraConfig, TaskType
from peft import get_peft_model
from transformers import BertForSequenceClassification, AutoModelForSequenceClassification
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000)).map(tokenize_function, batched=True)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=1, lora_dropout=0.1
)

model = AutoModelForSequenceClassification.from_pretrained("bigscience/bloomz-560m", num_labels=2)
# comment out the next two lines to try full fine-tuning
model = get_peft_model(model, lora_config).cuda()
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="test_trainer",
    per_device_train_batch_size=1,
    num_train_epochs=25,
    fp16=True,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset
)
trainer.train()

As you can see, I made some minor modifications besides switching to bloomz, but none of them change the overall findings.

WenxiongLiao commented 9 months ago

Sure, here is the script that I used:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from datasets import load_dataset
from transformers import AutoTokenizer
from peft import LoraConfig, TaskType
from peft import get_peft_model
from transformers import BertForSequenceClassification, AutoModelForSequenceClassification
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000)).map(tokenize_function, batched=True)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=1, lora_dropout=0.1
)

model = AutoModelForSequenceClassification.from_pretrained("bigscience/bloomz-560m", num_labels=2)
# comment out the next two lines to try full fine-tuning
model = get_peft_model(model, lora_config).cuda()
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="test_trainer",
    per_device_train_batch_size=1,
    num_train_epochs=25,
    fp16=True,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset
)
trainer.train()

As you can see, I made some minor modifications besides switching to bloomz, but none of them change the overall findings.

When I execute this code in my environment, the following situation occurs: LoRA takes up about 4GB of memory when training just starts, but it will stabilize at about 20GB in the later stages of training, which is about the same for full training. I think there must be a problem somewhere

BenjaminBossan commented 9 months ago

Could you please provide more details? How long did you train, when did you see this memory increase?

I ran this exact script for > 12 min on my machine (> 5000 steps) and I see no memory increase whatsoever. Peak memory never goes above what was initially allocated, it's pretty much flat the whole time.

WenxiongLiao commented 9 months ago

Could you please provide more details? How long did you train, when did you see this memory increase?

I ran this exact script for > 12 min on my machine (> 5000 steps) and I see no memory increase whatsoever. Peak memory never goes above what was initially allocated, it's pretty much flat the whole time.

I ran this script for 1000 steps on my machine. For full fine-tunning, the memory consumption increased from 14GB to 17GB and finally stabilized. For LoRA, the consumed memory increased from 7GB to 20GB and finally stabilized. Why does LoRA consume more memory than full fine-tunning? I always felt something was wrong.

Can you tell me the reason? In addition, please tell me the transformers, torch and peft versions of your machine. I will try in the same environment as yours to see if this is still the case.

BenjaminBossan commented 9 months ago

Why does LoRA consume more memory than full fine-tunning? I always felt something was wrong.

This should definitely not happen, there must be something incorrect going on.

The versions I use are:

transformers and peft are built from source. Could you please try to install the respective latest versions and try again?

WenxiongLiao commented 9 months ago

Why does LoRA consume more memory than full fine-tunning? I always felt something was wrong.

This should definitely not happen, there must be something incorrect going on.

The versions I use are:

  • accelerate==0.27.2
  • torch==2.2.0

transformers and peft are built from source. Could you please try to install the respective latest versions and try again?

Unfortunately, based on your environment, I will still be in the same situation as before. In addition, my program is always executed on GPU No. 0, even though I set os.environ["CUDA_VISIBLE_DEVICES"] = "2"

BenjaminBossan commented 9 months ago

Based on what you report, I really wonder if there is something is wrong in your environment, as I haven't observed this behavior. One thing you could test to help debugging is if you see the same issues if you write a pure PyTorch training loop, i.e. without Trainer and accelerate.

ddofer commented 9 months ago

I have the same issue, also with a sequence classification task (with LORA, QLORA). ESM2 models. 1 GPU

BenjaminBossan commented 9 months ago

I have the same issue, also with a sequence classification task (with LORA, QLORA). ESM2 models. 1 GPU

This is probably a separate issue (#1023).

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

nrafaili commented 1 month ago

I have the same issue, also with a sequence classification task (with LORA, QLORA). ESM2 models. 1 GPU

I also have the same issue with the ESM2 models and 1 GPU ... Any news on this ?

I should note, the issue even persists when I manually implement LoRA without the peft package. Maybe it's an issue within ESM2 ?

BenjaminBossan commented 1 month ago

@nrafaili Did you check the thread in #1023? An issue appears to be that for ESM2, torch reserves much more memory than it actually allocates. Did you observe the same behavior with your custom LoRA implementation?

nrafaili commented 1 month ago

@nrafaili Did you check the thread in #1023? An issue appears to be that for ESM2, torch reserves much more memory than it actually allocates. Did you observe the same behavior with your custom LoRA implementation?

Thank you for your response @BenjaminBossan I kept your code and LoRA hyperparams the same, but I added additional logic to freeze the model as initially 98% of model's params were trainable (with my custom lora implementation).

$python testing.py full_finetuning
trainable params:  7,739,006 || all params:  7,841,726 || trainable%: 98.69008430031857
0 1.650145 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
1 0.869868 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
2 0.775971 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
3 0.740067 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
4 0.726650 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
5 0.725119 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
6 0.834866 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
7 1.816483 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
8 1.182235 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
9 0.992770 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
10 0.788659 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
11 0.792953 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
12 0.740465 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
13 0.716211 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
14 0.734048 {'allocated': 0.04642200469970703, 'reserved': 5.87109375}
15 0.791791 {'allocated': 0.04631471633911133, 'reserved': 5.87109375}

$python testing.py last_layer_finetuning
trainable params:  1,605 || all params:  7,841,726 || trainable%: 0.02046743280752222
0 1.650145 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
1 0.906589 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
2 0.831657 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
3 0.790660 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
4 0.777377 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
5 0.768765 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
6 0.779916 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
7 0.895397 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
8 0.794046 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
9 0.725681 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
10 0.723187 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
11 0.729553 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
12 0.714959 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
13 0.746474 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
14 0.781195 {'allocated': 0.04550647735595703, 'reserved': 0.7265625}
15 0.815063 {'allocated': 0.04539918899536133, 'reserved': 0.7265625}

$python testing.py custom_lora_finetuning
trainable params:  93,765 || all params:  7,933,886 || trainable%: 1.1818294338991007
0 1.556088 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
1 0.911745 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
2 0.818445 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
3 0.785220 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
4 0.771485 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
5 0.770516 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
6 0.764579 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
7 0.751391 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
8 0.720615 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
9 0.744575 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
10 0.739499 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
11 0.758067 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
12 0.837883 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
13 0.796601 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
14 0.743176 {'allocated': 0.04676532745361328, 'reserved': 5.13671875}
15 0.735887 {'allocated': 0.04665803909301758, 'reserved': 5.13671875}

It seems like it is allocating slightly less memory, and reserving a lot less (but still a lot).

I was not able to perform the quantized testing, I kept getting this error and was unsure how to resolve it ... RuntimeError: shape '[51200, 1]' is invalid for input of size 102400

But I am more concerned with getting the memory issue figured out with the non quantized version first and then I can troubleshoot that. I also performed the experiment using BERT (bert-base-uncased), and it's a similar trend.

$python testing.py full_finetuning
trainable params:  108,895,493 || all params:  109,486,085 || trainable%: 99.46057802687893
0 1.627409 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
1 0.798242 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
2 4.995363 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
3 5.551763 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
4 4.034765 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
5 2.689088 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
6 4.740309 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
7 11.709240 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
8 5.442474 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
9 10.454267 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
10 3.879201 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
11 7.894711 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
12 2.659765 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
13 4.921863 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
14 2.954051 {'allocated': 0.42508935928344727, 'reserved': 11.658203125}
15 4.949404 {'allocated': 0.42498207092285156, 'reserved': 11.658203125}

$python testing.py last_layer_finetuning
trainable params:  3,845 || all params:  109,486,085 || trainable%: 0.003511861804173562
0 1.627409 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
1 1.678361 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
2 11.969365 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
3 2.394891 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
4 13.173813 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
5 2.617830 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
6 13.228453 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
7 0.837636 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
8 7.334270 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
9 9.572532 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
10 7.188825 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
11 9.910066 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
12 5.611490 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
13 7.714812 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
14 6.436080 {'allocated': 0.42508935928344727, 'reserved': 1.408203125}
15 9.650103 {'allocated': 0.42498207092285156, 'reserved': 1.408203125}

$python testing.py custom_lora_finetuning
trainable params:  446,213 || all params:  109,928,453 || trainable%: 0.4059121981822122
0 1.850149 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
1 0.900258 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
2 7.323673 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
3 4.002155 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
4 1.588919 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
5 3.291740 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
6 0.986782 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
7 4.414488 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
8 7.797668 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
9 1.500626 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
10 5.290092 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
11 3.174714 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
12 8.158594 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
13 1.614832 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
14 5.834119 {'allocated': 0.42673730850219727, 'reserved': 8.544921875}
15 3.144488 {'allocated': 0.42663002014160156, 'reserved': 8.544921875}

Do you have any suggestions for further troubleshooting ? Are we missing something ... ?

Thank you in advance and thank you for your time!

BenjaminBossan commented 1 month ago

Thanks for reporting these additional details. I think the allocated memory being reported looks suspicious, how did you measure that? It doesn't appear to really change at all, even when full fine-tuning.

Perhaps you could try this function to measure memory and see if that changes anything. The values may also depend on where in the training loop you call the function.

I ran some additional tests to check if things have changed with more recent versions. For that, I basically took this notebook and used the aforementioned function for measuring memory (putting it after optimizer.step() and before optimizer.zero_grad()).

What I found is that reserved memory is larger than allocated memory (which is expected), but the bigger the model, the smaller the gap.

nrafaili commented 1 month ago

I measured the memory using your code from the other issue, slightly modified. I've now changed the memory function to the new one, as it seems more accurate.

import warnings
import transformers
import torch
from transformers import  EsmModel, BertModel, BertConfig, EsmConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from torch import nn
from accelerate import Accelerator
import torch.nn.functional as F
import re
def get_memory_stats(device: torch.device, reset_stats: bool = True) -> dict:
    """
    Computes a memory summary for the passed in device. If reset_stats is True, this will
    also reset CUDA's peak memory tracking. This is useful to get data around relative use of peak
    memory (e.g. peak memory during model init, during forward, etc) and optimize memory for
    individual sections of training.
    Args:
        device (torch.device): Device to get memory summary for. Only CUDA devices are supported.
        reset_stats (bool): Whether to reset CUDA's peak memory tracking.
    Returns:
        Dict[str, float]: A dictionary containing the peak memory active, peak memory allocated,
        and peak memory reserved. This dict is useful for logging memory stats.
    Raises:
        ValueError: If the passed-in device is not CUDA.
    """
    if device.type != "cuda":
        raise ValueError(
            f"Logging memory stats is only supported on CUDA devices, got {device}"
        )
    peak_memory_active = torch.cuda.memory_stats().get("active_bytes.all.peak", 0) / (
        1024**3
    )
    peak_mem_alloc = torch.cuda.max_memory_allocated(device) / (1024**3)
    peak_mem_reserved = torch.cuda.max_memory_reserved(device) / (1024**3)
    if reset_stats:
        torch.cuda.reset_peak_memory_stats(device)
    memory_stats = {
        "peak_memory_active": peak_memory_active,
        "peak_memory_alloc": peak_mem_alloc,
        "peak_memory_reserved": peak_mem_reserved,
    }
    return memory_stats

def verify_data_types(model):
    # Verifying the datatypes.
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes:
            dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items():
        total += v
    for k, v in dtypes.items():
        print(f"{k}, {v}, {v / total}")

def get_nb_trainable_parameters(model):
    r"""
    Returns the number of trainable parameters and number of all parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        # Due to the design of 4bit linear layers from bitsandbytes
        # one needs to multiply the number of parameters by 2 to get
        # the correct number of parameters
        if param.__class__.__name__ == "Params4bit":
            num_params = num_params * 2

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params

    return trainable_params, all_param

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params, all_param = get_nb_trainable_parameters(model)
    print(
        f"trainable params: {trainable_params: ,} || all params: {all_param: ,} || trainable%: {100 * trainable_params / all_param}"
    )

class LoRAConfig:
    def __init__(self):
        self.lora_rank = 8
        self.lora_alpha = 32 
        self.lora_init_scale = 0.01
        self.trainable_param_names = "lora_A|lora_B"
        self.lora_modules = [
            "attention.self.query",
            "attention.self.key",
            "attention.self.value"
        ]
        self.lora_layers = "query|key|value"

class LoRALinear(nn.Module):
    def __init__(self, linear_layer, rank, alpha, init_scale):
        super().__init__()
        self.in_features = linear_layer.in_features
        self.out_features = linear_layer.out_features
        self.rank = rank
        self.scaling = alpha / rank
        self.weight = linear_layer.weight
        self.bias = linear_layer.bias

        self.freeze_original_params()

        if isinstance(self.weight, torch.nn.Parameter):
            self.weight_shape = self.weight.shape
        else:  # For quantized weights
            self.weight_shape = self.weight.shape()

        if self.rank > 0:
            self.lora_A = nn.Parameter(torch.randn(self.in_features, rank) * init_scale)
            self.lora_B = nn.Parameter(torch.zeros(rank, self.out_features))

    def freeze_original_params(self):
        if isinstance(self.weight, torch.nn.Parameter):
            self.weight.requires_grad = False
        if self.bias is not None:
            self.bias.requires_grad = False

    def forward(self, input):
        if not isinstance(self.weight, torch.nn.Parameter):
            weight = self.weight().to(input.dtype)
        else:
            weight = self.weight

        if self.rank > 0:
            lora_weight = torch.matmul(self.lora_A, self.lora_B).view(self.weight_shape)
            weight = weight + lora_weight * self.scaling

        return F.linear(input, weight, self.bias)        

def modify_with_lora(model, config):
    for name, module in model.named_modules():
        if any(re.search(pattern, name) for pattern in config.lora_modules):
            if re.fullmatch(config.lora_layers, name.split('.')[-1]):
                if isinstance(module, nn.Linear) or (hasattr(module, 'weight') and callable(getattr(module, 'weight'))):
                    lora_layer = LoRALinear(
                        module,
                        config.lora_rank,
                        config.lora_alpha,
                        config.lora_init_scale
                    )
                    parent_name = name.rsplit('.', 1)[0]
                    parent_module = dict(model.named_modules())[parent_name]
                    setattr(parent_module, name.split('.')[-1], lora_layer)

    # Freeze all parameters except LoRA parameters
    for name, param in model.named_parameters():
        if not any(x in name for x in ['lora_A', 'lora_B']):
            param.requires_grad_(False)
        else:
            param.requires_grad_(True)

    return model

def esm_config():
    return {
        "attention_probs_dropout_prob": 0.0,
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.0,
        "hidden_size": 320,
        "initializer_range": 0.02,
        "intermediate_size": 1280,
        "max_position_embeddings": 1026,
        "num_attention_heads": 20,
        "num_hidden_layers": 6,
        "vocab_size": 33
    }

class Encoder(nn.Module):
    def __init__(self, base_model, use_peft=False, use_custom_lora=False, last_layer_only=False):
        super().__init__()

        self.model = base_model
        if use_peft:
            config = LoraConfig(
                r=8, lora_alpha=32, target_modules=["query", "key", "value"],
                lora_dropout=0.05, bias="none"
            )
            self.model = get_peft_model(self.model, config)
        elif use_custom_lora:
            lora_config = LoRAConfig()
            self.model = modify_with_lora(self.model, lora_config)

        self.pooling_layer = nn.AdaptiveAvgPool1d(output_size=1)
        self.head = nn.Linear(self.model.embeddings.position_embeddings.embedding_dim, 5)

        if last_layer_only:
            for param in self.model.parameters():
                param.requires_grad = False
            for param in self.head.parameters():
                param.requires_grad = True

    def forward(self, x):
        outputs = self.model(input_ids=x['input_ids'], attention_mask=x['attention_mask'])
        last_hidden_state = outputs.last_hidden_state if hasattr(outputs, 'last_hidden_state') else outputs[0]
        transposed_feature = last_hidden_state.transpose(1, 2)
        pooled_features = self.pooling_layer(transposed_feature).squeeze(2)
        classification_head = self.head(pooled_features)
        return classification_head

def train(net, accelerator, dataloader, optimizer, loss_function, num_epochs=1):
    net, optimizer, dataloader = accelerator.prepare(net, optimizer, dataloader)
    device = accelerator.device

    for epoch in range(num_epochs):
        for i, (inputs, labels) in enumerate(dataloader):
            with accelerator.accumulate(net):
                outputs = net(inputs)
                loss = loss_function(outputs, labels.squeeze())
                accelerator.backward(loss)
                optimizer.step()

            if i % 3 == 0:  # Report every 3 steps
                memory_stats = get_memory_stats(device, reset_stats=True)
                print(f"Epoch {epoch}, Step {i}:")
                print(f"  Loss: {loss.item():.4f}")
                print("  Memory stats:")
                for key, value in memory_stats.items():
                    print(f"    {key}: {value:.9f} GB")
                print("-" * 50)
                optimizer.zero_grad()

class MyDataset(torch.utils.data.Dataset):
    def __init__(self):
        pass

    def __len__(self):
        return 2000

    def __getitem__(self, i):
        seq = torch.randint(0, 20, size=(SEQ_LEN,))
        labels = torch.randint(0, 2,  size=(1,))
        return {"input_ids": seq, "attention_mask": torch.ones_like(seq)}, labels

torch.manual_seed(0)
SEQ_LEN = 150
warnings.filterwarnings("ignore", message="A parameter name that contains `beta` will be renamed internally to `bias`.*")
warnings.filterwarnings("ignore", message="A parameter name that contains `gamma` will be renamed internally to `weight`.*")
transformers.logging.set_verbosity_error()

def run_experiment(model_name, finetuning_method):
    print(f"Running experiment: {model_name} - {finetuning_method}")

    if model_name == "bert-base-uncased":
        base_model = BertModel.from_pretrained("bert-base-uncased")
    elif model_name == "bert-esm-config":
        config = BertConfig.from_dict(esm_config())
        base_model = BertModel(config)
    elif model_name == "esm-8m":
        base_model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
    elif model_name == "esm-8m-absolute":
        config = EsmConfig.from_pretrained("facebook/esm2_t6_8M_UR50D")
        config.position_embedding_type = "absolute"
        base_model = EsmModel(config)

    net = Encoder(base_model, use_peft=(finetuning_method == "peft_lora"), 
                  use_custom_lora=(finetuning_method == "custom_lora"),
                  last_layer_only=(finetuning_method == "last_layer"))

    verify_data_types(net)
    print_trainable_parameters(net)

    dataset = MyDataset()
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=128)
    optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, net.parameters()), lr=5e-5)
    criterion = nn.CrossEntropyLoss()

    accelerator = Accelerator()
    train(net, accelerator, dataloader, optimizer, criterion)
    del net,base_model
    torch.cuda.empty_cache()

if __name__ == "__main__":

    models = ["bert-base-uncased", "bert-esm-config", "esm-8m", "esm-8m-absolute"]
    finetuning_methods = ["full", "last_layer", "custom_lora", "peft_lora"]

    for model in models:
        for method in finetuning_methods:
            run_experiment(model, method)

In regards to your comment about the memory not changing, I looked back at your reported memory in issue 1023 and it also seems largely unchanged. I have been noticing a slight difference from step 0 to step 3, and then steps 3-15 remain the same or about the same.

I also ran the same code with the get_gpu_memory() function from issue 1023. The reported memory usage differs quite a lot.

New function (with the code reported above)

Old function (code from issue 1023)

ESM and BERT are largely identical except for the positional encoding. I modified the positional encoding of the ESM config to 'absolute' from 'rotary'. The allocated memory differed in the thousandth place, while the reserved remained the same. I also modified BERT config to match that of ESM 8M. I calculated the ratios of reserved:allocated memory for all 4 models. I calculated the following from the above reported raw data.

full_finetuning, last_layer_finetuning, custom_LoRA, peft_LoRA (new function) ESM: 1.11x, 1.09x, 1.15x, 1.12x ESM (modified): 1.11x, 1.06x, 1.14x, 1.14x BERT(modified): 1.19x, 1.76x, 1.12x, 1.11x BERT(unmodified): 1.03x, 1.65x, 1.05x, 1.04x

full_finetuning, last_layer_finetuning, custom_LoRA, peft_LoRA (old function) ESM: 43.49x, 16.35x, 103x, 108.9x, ESM (modified): 42.42x, 15.73x, 102.7x, 111x BERT(modified): 22.6x, 11.4x, 41.57x, 50.5x BERT(unmodified): 7.67x, 4x, 20x, 24.22x

It seems that the new function to calculate memory is more robust / accurate depiction than the old. Lastly, I also tested the base model memory usage using the two functions: Old

Running memory test: esm-8m
Memory stats for esm-8m:
  allocated: 0.029232025 GB
  reserved: 0.050781250 GB
Running memory test: esm-8m-absolute
Memory stats for esm-8m-absolute:
  allocated: 0.029229164 GB
  reserved: 0.050781250 GB
Running memory test: bert-base-uncased
Memory stats for bert-base-uncased:
  allocated: 0.408915520 GB
  reserved: 0.460937500 GB
Running memory test: bert-esm-config
Memory stats for bert-esm-config:
  allocated: 0.029238701 GB
  reserved: 0.050781250 GB

New

Running memory test: esm-8m
Memory stats for esm-8m:
  peak_memory_active: 0.029238701 GB
  peak_memory_alloc: 0.029238701 GB
  peak_memory_reserved: 0.470703125 GB
Running memory test: esm-8m-absolute
Memory stats for esm-8m-absolute:
  peak_memory_active: 0.029232025 GB
  peak_memory_alloc: 0.029232025 GB
  peak_memory_reserved: 0.470703125 GB
Memory stats for bert-base-uncased:
  peak_memory_active: 0.408915520 GB
  peak_memory_alloc: 0.408915520 GB
  peak_memory_reserved: 0.460937500 GB
Running memory test: bert-esm-config
Memory stats for bert-esm-config:
  peak_memory_active: 0.408915520 GB
  peak_memory_alloc: 0.408915520 GB
  peak_memory_reserved: 0.470703125 GB

Sorry it took me so long to response. What can you make of this ? Am I doing something wrong anywhere, or the way I am reporting my results ? I am a new researcher and new coder, so any advice is appreciated. A notable finding is that the custom LoRA implementation utilizes less memory than the peft_lora, and the difference becomes larger the larger the model (the number of params and % trainable params are exactly the same among the two).

BenjaminBossan commented 1 month ago

Thanks for the detailed report. So I think we can take away that the new function is more accurate in measuring memory usage. Also, I still think we see the trend that LoRA generally helps, the bigger the models the more it helps (of course with many caveats, but this the tendency).

A notable finding is that the custom LoRA implementation utilizes less memory than the peft_lora, and the difference becomes larger the larger the model

This is hard for me to judge, as it would require very fine-grained analysis to figure out why that could be. At a glance, I see this line as a potential cause:

            lora_weight = torch.matmul(self.lora_A, self.lora_B).view(self.weight_shape)
            weight = weight + lora_weight * self.scaling

        return F.linear(input, weight, self.bias)

So what you're doing is to calculate the extra weight from LoRA, add it to the base weight, then perform the F.linear operation. In PEFT, we don't add the weights but instead calculate the activations from base and LoRA separately, then add them:

https://github.com/huggingface/peft/blob/0d5894271bf5848793caa4223eac949827ef6a0d/src/peft/tuners/lora/layer.py#L574-L586

Mathemtically, these operations should be the same (ignoring floating arithmetic) but memory-wise, they can differ. There are reasons why we want to calculate the activations like this in PEFT, whereas your code does not have the same constraints. But if you're happy with your custom implementation, I'd say go ahead and use it.

Not sure if you need anything more from me on this issue, if so let me know.

nrafaili commented 1 month ago

Thank you so much for the quick response.

So, in your opinion, the memory utilization is normal, correct ? I guess the benefit with LoRA comes with bigger models, otherwise it might be worth to just do full finetuning. Is that statement correct ?

BenjaminBossan commented 1 month ago

So, in your opinion, the memory utilization is normal, correct ?

At least I can't immediately see anything wrong with the memory consumption.

I guess the benefit with LoRA comes with bigger models, otherwise it might be worth to just do full finetuning. Is that statement correct ?

I think a bit more nuance is needed. There are multiple factors contributing to overall memory consumption:

  1. the frozen parameters from the base model weights (which is where quantization enters the picture)
  2. the learnable parameters coming from LoRA, which require memory for gradients and optimizer states on top (but since those are only few parameters, this doesn't matter much overall)
  3. the memory for hidden states/activations (e.g. for transformers, this grows very quickly with sequence length)

When we do full fine-tuning, we need to add memory for the gradients and optimizer states of the base model weights (1). When the model is small, the memory required for this will be small compared to the memory for the hidden states (3), thus we don't see a big advantage. However, the larger the model is, the bigger this memory requirement will be large compared to the hidden states, making the savings from PEFT more effective. Does that make sense?

nrafaili commented 1 month ago

That makes perfect sense. Very insightful. I really appreciate your help with all this.

One last question if you don't mind:

Why do you think there is such a stark difference in the reported allocated memory (using the old function) for some of these models, particularly for the LoRA implementations? For example, the ratios for ESM and the modified BERT

full_finetuning, last_layer_finetuning, custom_LoRA, peft_LoRA (old function) ESM: 43.49x, 16.35x, 103x, 108.9x, ESM (modified): 42.42x, 15.73x, 102.7x, 111x BERT(modified): 22.6x, 11.4x, 41.57x, 50.5x BERT(unmodified): 7.67x, 4x, 20x, 24.22x

seem exaggerated in smaller models, but they stabilize more as the model size increases (however still vastly different from the .max function). What could be causing such high ratios in smaller models, especially when using LoRA? I'm sure it has to do with how the function measures memory, just wanted to see if you had any insights as to that. Thanks again!!!