Closed mallorbc closed 6 months ago
This is not fully reproducible. You need to also share what's inside the args
.
Cc: @BenjaminBossan
I tried to simplify it as instructed.
If you want fully reproducible, you can do the following:
build_image.sh
run_image.sh
accelerate launch trl_finetune.py -m mistralai/Mistral-7B-v0.1 -tf train.csv -vf validation.csv --block_size 4096 -e 1 --dora --pad_token_id 0 --all_linear --gradient_checkpointing -o mistral_finetune_checkpoints -b 1 --gradient_accumulation_steps 16 --log_steps 1646 --save_steps 1646 --eval_steps 1646 --warmup_steps 1317 --lora_alpha 16
So rank here is 64, alpha is 16, dropout is 0.1, modules_to_save is None(only used in Long Lora), target modules are: ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']
Those are the values inside of the args and how I ran this.
No I mean, we don't know the lora_rank
, target_modules
, etc. This is something you can definitely provide and we really shouldn't be expected to look for these values in other repositories.
I can give you any information you would like. Please let me know how else I can help
model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name, token=access_token,use_flash_attention_2=True)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, inference_mode=False, r=args.lora_rank, lora_alpha=args.lora_alpha, lora_dropout=args.lora_dropout,target_modules=target_modules,modules_to_save=modules_to_save,use_dora=args.dora
)
model = get_peft_model(model, peft_config)
Please revise this code with the actual values of args.lora_rank
, args.lora_alpha
, etc.
import torch
from peft import LoraConfig, TaskType,get_peft_model,prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,TrainingArguments,AutoConfig
model_name = "mistralai/Mistral-7B-v0.1"
config_kwargs = {
"trust_remote_code": True,
}
config = AutoConfig.from_pretrained(model_name, **config_kwargs)
config.use_cache = False
config.gradient_checkpointing = True
kwargs = {"device_map":None}
bnb_config = None
target_modules = ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']
model = AutoModelForCausalLM.from_pretrained(model_name,token=access_token,quantization_config=bnb_config,trust_remote_code=True,torch_dtype=torch.bfloat16,config=config,use_flash_attention_2=True, **kwargs)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=16,lora_dropout=0.1,target_modules=target_modules,modules_to_save=None,use_dora=True
)
model = get_peft_model(model, peft_config)
That should be everything relevant I believe. Outside of things that are controlled by accelerate that is. Please let me know how else I might help resolve this issue.
Thanks!
Thanks for reporting. I tested a slightly modified script without flash attention 2 and this is what i got:
So yes, DoRA adds considerable overhead, but no, it should not take several extra minutes for this model. Note that initializing DoRA requires extra steps compared to LoRA, which cannot be avoided, so a certain overhead is expected (more so with quantized weights).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
This is still an issue, but I think I figured out why.
Here is a slightly modified version of the program:
import torch
from peft import LoraConfig, TaskType,get_peft_model,prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,TrainingArguments,AutoConfig
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_name", type=str, help="model name",default="mistralai/Mistral-7B-v0.1")
parser.add_argument("-cpu", "--cpu", action="store_true", help="use cpu",default=False)
parser.add_argument("-flash", "--flash", action="store_true", help="use flash",default=False)
parser.add_argument("-dora", "--dora", action="store_true", help="use dora",default=False)
args = parser.parse_args()
model_name = args.model_name
config_kwargs = {
"trust_remote_code": True,
}
config = AutoConfig.from_pretrained(model_name, **config_kwargs)
config.use_cache = False
config.gradient_checkpointing = True
if args.cpu:
kwargs = {"device_map":None}
else:
kwargs = {"device_map":"auto"}
bnb_config = None
target_modules = ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']
model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=bnb_config,trust_remote_code=True,torch_dtype=torch.bfloat16,config=config,attn_implementation="flash_attention_2" if args.flash else None, **kwargs)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=16,lora_dropout=0.1,target_modules=target_modules,modules_to_save=None,use_dora=args.dora
)
model = get_peft_model(model, peft_config)
I called this test_dora.py
I noticed that if I load the Lora model on the CPU with flash it worked fine:
python test_dora.py -cpu -flash 26.91s user 21.29s system 700% cpu 6.877 total
Lora without flash on CPU was great:
python test_dora.py -cpu 26.36s user 21.33s system 701% cpu 6.799 total
Lora on GPU without flash was great:
python test_dora.py 20.73s user 4.83s system 370% cpu 6.897 total
Lora on GPU with flash was great:
python test_dora.py --flash 21.22s user 4.92s system 378% cpu 6.915 total
Dora on GPU is great:
python test_dora.py -dora 20.47s user 5.12s system 366% cpu 6.987 total
Dora on GPU with flash is great:
python test_dora.py -dora -flash 22.37s user 4.64s system 383% cpu 7.044 total
However, Dora on CPU is where it is very slow. I am talking at least 20-100x slower to load if it does at all.
What I have found is that having less target modules makes it go faster, but still very slow.
For my purposes, I am loading the model on CPU and then passing it to TRL with a lora config to get things going in normal uses. Am I not supposed to load model on the CPU and let accelerate handle the rest?
System Info
Package Version
accelerate 0.29.0.dev0 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 appdirs 1.4.4 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.0 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 datasets 2.18.0 deepspeed 0.14.0+ce78a632 dill 0.3.8 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.7.0 exceptiongroup 1.2.0 filelock 3.13.3 flash-attn 2.5.6 frozenlist 1.4.1 fsspec 2024.2.0 gitdb 4.0.11 GitPython 3.1.42 hjson 3.1.0 huggingface-hub 0.22.1 idna 3.6 iniconfig 2.0.0 Jinja2 3.1.3 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.1 ninja 1.11.1.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.0.3 peft 0.10.1.dev0 pillow 10.2.0 pip 24.0 pluggy 1.4.0 protobuf 3.20.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 2.6.4 pydantic_core 2.16.3 Pygments 2.17.2 pynvml 11.5.0 pytest 8.1.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 rich 13.7.1 safetensors 0.4.2 scipy 1.10.1 sentencepiece 0.2.0 sentry-sdk 1.43.0 setproctitle 1.3.3 setuptools 69.2.0 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sympy 1.12 text-generation 0.7.0 tokenizers 0.15.2 tomli 2.0.1 torch 2.2.1 torchaudio 2.2.1 torchvision 0.17.1 tqdm 4.66.2 transformers 4.40.0.dev0 triton 2.2.0 trl 0.8.1 typing_extensions 4.10.0 tyro 0.7.3 tzdata 2024.1 urllib3 2.2.1 wandb 0.16.5 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4
python 3.11
I have tested this on both a dual A100 and dual 3090 system. Using the same docker image.
Who can help?
@pacman100 @younesbelkada @sayakpaul
When calling the
get_peft_model
method with config that hasuse_dora=True
the time to get a model is VERY long(several minutes). Meanwhile, if I just use a regular Lora model, I get the model almost immediately. I also do not have this issue when using a QDora model oddly enough.Information
Tasks
examples
folderReproduction
I removed some stuff to keep it simple. If you want to see a more complete example on how I am running this, please see the code here
Expected behavior
I would expect Dora to load as quickly as Lora, or at least not several orders of magnitude slower.