Getting Dora Model Is Very Slow

mallorbc commented 7 months ago

System Info

Package Version

accelerate 0.29.0.dev0 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 appdirs 1.4.4 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.0 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 datasets 2.18.0 deepspeed 0.14.0+ce78a632 dill 0.3.8 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.7.0 exceptiongroup 1.2.0 filelock 3.13.3 flash-attn 2.5.6 frozenlist 1.4.1 fsspec 2024.2.0 gitdb 4.0.11 GitPython 3.1.42 hjson 3.1.0 huggingface-hub 0.22.1 idna 3.6 iniconfig 2.0.0 Jinja2 3.1.3 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.1 ninja 1.11.1.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.0.3 peft 0.10.1.dev0 pillow 10.2.0 pip 24.0 pluggy 1.4.0 protobuf 3.20.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 2.6.4 pydantic_core 2.16.3 Pygments 2.17.2 pynvml 11.5.0 pytest 8.1.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 rich 13.7.1 safetensors 0.4.2 scipy 1.10.1 sentencepiece 0.2.0 sentry-sdk 1.43.0 setproctitle 1.3.3 setuptools 69.2.0 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sympy 1.12 text-generation 0.7.0 tokenizers 0.15.2 tomli 2.0.1 torch 2.2.1 torchaudio 2.2.1 torchvision 0.17.1 tqdm 4.66.2 transformers 4.40.0.dev0 triton 2.2.0 trl 0.8.1 typing_extensions 4.10.0 tyro 0.7.3 tzdata 2024.1 urllib3 2.2.1 wandb 0.16.5 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4

python 3.11

I have tested this on both a dual A100 and dual 3090 system. Using the same docker image.

Who can help?

@pacman100 @younesbelkada @sayakpaul

When calling the get_peft_model method with config that has use_dora=True the time to get a model is VERY long(several minutes). Meanwhile, if I just use a regular Lora model, I get the model almost immediately. I also do not have this issue when using a QDora model oddly enough.

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name, token=access_token,use_flash_attention_2=True)
peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=args.lora_rank, lora_alpha=args.lora_alpha, lora_dropout=args.lora_dropout,target_modules=target_modules,modules_to_save=modules_to_save,use_dora=args.dora
        )
model = get_peft_model(model, peft_config)

I removed some stuff to keep it simple. If you want to see a more complete example on how I am running this, please see the code here

Expected behavior

I would expect Dora to load as quickly as Lora, or at least not several orders of magnitude slower.

sayakpaul commented 7 months ago

This is not fully reproducible. You need to also share what's inside the args.

Cc: @BenjaminBossan

mallorbc commented 7 months ago

I tried to simplify it as instructed.

If you want fully reproducible, you can do the following:

Clone https://github.com/mallorbc/Finetune_LLMs
build the docker image with the build_image.sh
Run the image with run_image.sh
cd finetuning_repo
configure accelerate to use 1 machine, two GPUs, DeepSpeed 0, bf16(I also tried just using one GPU and no Deepspeed)
accelerate launch trl_finetune.py -m mistralai/Mistral-7B-v0.1 -tf train.csv -vf validation.csv --block_size 4096 -e 1 --dora --pad_token_id 0 --all_linear --gradient_checkpointing -o mistral_finetune_checkpoints -b 1 --gradient_accumulation_steps 16 --log_steps 1646 --save_steps 1646 --eval_steps 1646 --warmup_steps 1317 --lora_alpha 16

So rank here is 64, alpha is 16, dropout is 0.1, modules_to_save is None(only used in Long Lora), target modules are: ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']

Those are the values inside of the args and how I ran this.

sayakpaul commented 7 months ago

No I mean, we don't know the lora_rank, target_modules, etc. This is something you can definitely provide and we really shouldn't be expected to look for these values in other repositories.

mallorbc commented 7 months ago

I can give you any information you would like. Please let me know how else I can help

sayakpaul commented 7 months ago

model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name, token=access_token,use_flash_attention_2=True)
peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=args.lora_rank, lora_alpha=args.lora_alpha, lora_dropout=args.lora_dropout,target_modules=target_modules,modules_to_save=modules_to_save,use_dora=args.dora
        )
model = get_peft_model(model, peft_config)

Please revise this code with the actual values of args.lora_rank, args.lora_alpha, etc.

mallorbc commented 7 months ago

import torch
from peft import LoraConfig, TaskType,get_peft_model,prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,TrainingArguments,AutoConfig

model_name = "mistralai/Mistral-7B-v0.1"
 config_kwargs = {
        "trust_remote_code": True,
    }
config = AutoConfig.from_pretrained(model_name, **config_kwargs)
config.use_cache = False
config.gradient_checkpointing = True
kwargs = {"device_map":None}
bnb_config = None
target_modules = ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']

model = AutoModelForCausalLM.from_pretrained(model_name,token=access_token,quantization_config=bnb_config,trust_remote_code=True,torch_dtype=torch.bfloat16,config=config,use_flash_attention_2=True, **kwargs)

peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=16,lora_dropout=0.1,target_modules=target_modules,modules_to_save=None,use_dora=True
        )
model = get_peft_model(model, peft_config)

mallorbc commented 7 months ago

That should be everything relevant I believe. Outside of things that are controlled by accelerate that is. Please let me know how else I might help resolve this issue.

Thanks!

BenjaminBossan commented 7 months ago

Thanks for reporting. I tested a slightly modified script without flash attention 2 and this is what i got:

Total wall time: 24 sec
Time to load the PEFT model (last line): 12.8 sec
Of those, time spent on special DoRA initialization code: 10.8 sec

So yes, DoRA adds considerable overhead, but no, it should not take several extra minutes for this model. Note that initializing DoRA requires extra steps compared to LoRA, which cannot be avoided, so a certain overhead is expected (more so with quantized weights).

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

mallorbc commented 6 months ago

This is still an issue, but I think I figured out why.

Here is a slightly modified version of the program:

import torch
from peft import LoraConfig, TaskType,get_peft_model,prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,TrainingArguments,AutoConfig
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_name", type=str, help="model name",default="mistralai/Mistral-7B-v0.1")
parser.add_argument("-cpu", "--cpu", action="store_true", help="use cpu",default=False)
parser.add_argument("-flash", "--flash", action="store_true", help="use flash",default=False)
parser.add_argument("-dora", "--dora", action="store_true", help="use dora",default=False)
args = parser.parse_args()
model_name = args.model_name

config_kwargs = {
        "trust_remote_code": True,
    }
config = AutoConfig.from_pretrained(model_name, **config_kwargs)
config.use_cache = False
config.gradient_checkpointing = True
if args.cpu:
    kwargs = {"device_map":None}
else:
    kwargs = {"device_map":"auto"}
bnb_config = None
target_modules = ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']

model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=bnb_config,trust_remote_code=True,torch_dtype=torch.bfloat16,config=config,attn_implementation="flash_attention_2" if args.flash else None, **kwargs)

peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=16,lora_dropout=0.1,target_modules=target_modules,modules_to_save=None,use_dora=args.dora
        )
model = get_peft_model(model, peft_config)

I called this test_dora.py

I noticed that if I load the Lora model on the CPU with flash it worked fine: python test_dora.py -cpu -flash 26.91s user 21.29s system 700% cpu 6.877 total Lora without flash on CPU was great: python test_dora.py -cpu 26.36s user 21.33s system 701% cpu 6.799 total Lora on GPU without flash was great: python test_dora.py 20.73s user 4.83s system 370% cpu 6.897 total Lora on GPU with flash was great: python test_dora.py --flash 21.22s user 4.92s system 378% cpu 6.915 total Dora on GPU is great: python test_dora.py -dora 20.47s user 5.12s system 366% cpu 6.987 total Dora on GPU with flash is great: python test_dora.py -dora -flash 22.37s user 4.64s system 383% cpu 7.044 total

However, Dora on CPU is where it is very slow. I am talking at least 20-100x slower to load if it does at all.

What I have found is that having less target modules makes it go faster, but still very slow.

For my purposes, I am loading the model on CPU and then passing it to TRL with a lora config to get things going in normal uses. Am I not supposed to load model on the CPU and let accelerate handle the rest?

huggingface / peft