huggingface / peft

πŸ€— PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
16k stars 1.57k forks source link

FSDP Dora/QDora Broken #1737

Closed mallorbc closed 3 months ago

mallorbc commented 4 months ago

System Info

Package Version


accelerate 0.30.1 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 datasets 2.19.1 deepspeed 0.14.2+5f631abc dill 0.3.8 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 eval_type_backport 0.2.0 exceptiongroup 1.2.1 filelock 3.14.0 flash-attn 2.5.8 frozenlist 1.4.1 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 hf_transfer 0.1.6 hjson 3.1.0 huggingface-hub 0.23.0 idna 3.7 iniconfig 2.0.0 Jinja2 3.1.4 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.1 ninja 1.11.1.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.0.3 peft 0.11.1.dev0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 pluggy 1.5.0 protobuf 3.20.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.1 pydantic_core 2.18.2 Pygments 2.18.0 pynvml 11.5.0 pytest 8.2.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.31.0 rich 13.7.1 safetensors 0.4.3 scipy 1.10.1 sentencepiece 0.2.0 sentry-sdk 2.2.0 setproctitle 1.3.3 setuptools 69.5.1 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sympy 1.12 text-generation 0.7.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.0 torchaudio 2.3.0 torchvision 0.18.0 tqdm 4.66.4 transformers 4.40.2 triton 2.3.0 trl 0.8.6 typing_extensions 4.11.0 tyro 0.8.4 tzdata 2024.1 urllib3 2.2.1 wandb 0.17.0 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4

I am using two RTX 3090s with Ubuntu 12.2.2 inside of a docker container.

Regular Lora/QLora with FSDP works.

Not sure where this should go. Either PEFT or accelerate I would guess.

I feel like this issue might be related to the following: https://github.com/huggingface/peft/issues/1674 https://github.com/huggingface/accelerate/issues/2761 https://github.com/huggingface/peft/issues/1593#issuecomment-2116202685

Who can help?

@pacman100 @younesbelkada @BenjaminBossan

Information

Tasks

Reproduction

  1. Install the requirements that I have. They are all the latest releases except for PEFT which uses the main branch install due to another recent PR that fixed QLora.
  2. Try using Dora or QDora. You will notice 1 of 2 types of errors I have found.
  3. One of those errors is that the Dora model never appears and times out(or you kill it after waiting for 10+ minutes
  4. Another error is much longer and logs are below.

Both DDP and FSDP work with regular Lora/QLora

Scripts

Working Dora DDP config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  zero3_init_flag: false
  zero_stage: 0
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Broken Dora FSDP config

compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false                                                                                                                                                                 
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Simple Program To Test

You can use this program to see how it is broken. Running on CPU with regular Dora will be much slower, but it will still work.

import torch
from peft import LoraConfig, TaskType,get_peft_model,prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,TrainingArguments,AutoConfig
import argparse
from accelerate import Accelerator

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_name", type=str, help="model name",default="mistralai/Mistral-7B-v0.1")
parser.add_argument("-cpu", "--cpu", action="store_true", help="use cpu",default=False)
parser.add_argument("-flash", "--flash", action="store_true", help="use flash",default=False)
parser.add_argument("-dora", "--dora", action="store_true", help="use dora",default=False)
parser.add_argument("-int4", "--int4", action="store_true", help="use int4",default=False)
parser.add_argument("-accelerate", "--accelerate", action="store_true", help="use accelerate",default=False)
args = parser.parse_args()
model_name = args.model_name

config_kwargs = {
        "trust_remote_code": True,
    }
config = AutoConfig.from_pretrained(model_name, **config_kwargs)
config.use_cache = False
config.gradient_checkpointing = True
if args.cpu:
    kwargs = {"device_map":None}
elif args.accelerate:
    kwargs = {}
    device_index = Accelerator().process_index
    device_map = {"": device_index}
    kwargs["device_map"] = device_map
else:
    kwargs = {"device_map":"auto"}
if not args.int4:
    bnb_config = None
else:
    bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32,
)
target_modules = ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']
torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=bnb_config,trust_remote_code=True,torch_dtype=torch_dtype,config=config,attn_implementation="flash_attention_2" if args.flash else None, **kwargs)

peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=16,lora_dropout=0.1,target_modules=target_modules,modules_to_save=None,use_dora=args.dora
        )
model = get_peft_model(model, peft_config)

Example Uses And Current Results

FSDP Lora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -accelerate -flash 11.85 seconds FSDP QLora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -accelerate -flash -int4 16.86 seconds DDP Lora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -accelerate -flash 12.84 seconds DDP QLora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -accelerate -flash -int4 12.85 seconds

FSDP Dora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora -accelerate -flash killed after waiting 5+ minutes FSDP QDora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora -accelerate -flash -int4 killed after waiting 5+ minutes DDP Dora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora -accelerate -flash 12.83 seconds DDP QDora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora -accelerate -flash -int4 12.85 seconds

Regular Lora time python test_dora.py -flash 6.92 Regular Dora time python test_dora.py -flash -dora 6.99 Regular QLora time python test_dora.py -flash -int4 7.45 Regular QDora time python test_dora.py -flash -dora -int4 7.52 Regular Lora CPU time python test_dora.py -flash -cpu 6.886 Regular QLora CPU time python test_dora.py -flash -cpu -int4 7.16 Regular Dora CPU time python test_dora.py -flash -cpu -dora killed after 10+ minutes but I have gotten this working before(or at least I am pretty sure) Regular QDora CPU time python test_dora.py -flash -cpu -dora --int4 7.10

Expected behavior

I would expect the same behavior as regular Lora/QLora. That meaning that training successfully occurs and the sample script runs.

mallorbc commented 4 months ago

rank0: Traceback (most recent call last): rank0: File "trl_finetune.py", line 401, in

rank0: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train rank0: output = super().train(*args, kwargs) rank0: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train rank0: return inner_training_loop( rank0: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop rank0: self.model = self.accelerator.prepare(self.model) rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare rank0: result = tuple( rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in rank0: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one rank0: return self.prepare_model(obj, device_placement=device_placement) rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model rank0: model = FSDP(model, kwargs) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init

rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap rank0: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: Previous line repeated 2 more times: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap rank0: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap rank0: return wrapper_cls(module, kwargs) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init

rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module rank0: _init_param_handle_from_params(state, managed_params, fully_sharded_module) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params rank0: handle = FlatParamHandle( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init

rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata rank0: ) = self._validate_tensors_to_flatten(params) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten rank0: raise ValueError( rank0: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32 Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 20201/20201 [00:01<00:00, 14172.58 examples/s] Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3541/3541 [00:00<00:00, 14188.14 examples/s] /usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code. warnings.warn( rank1: Traceback (most recent call last): rank1: File "trl_finetune.py", line 401, in

rank1: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train rank1: output = super().train(*args, kwargs) rank1: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train rank1: return inner_training_loop( rank1: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop rank1: self.model = self.accelerator.prepare(self.model) rank1: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare rank1: result = tuple( rank1: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in rank1: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) rank1: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one rank1: return self.prepare_model(obj, device_placement=device_placement) rank1: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model rank1: model = FSDP(model, kwargs) rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init

rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap rank1: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank1: wrapped_child, num_wrapped_params = _recursive_wrap( rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank1: wrapped_child, num_wrapped_params = _recursive_wrap( rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank1: wrapped_child, num_wrapped_params = _recursive_wrap( rank1: Previous line repeated 2 more times: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap rank1: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap rank1: return wrapper_cls(module, kwargs) rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init

rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module rank1: _init_param_handle_from_params(state, managed_params, fully_sharded_module) rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params rank1: handle = FlatParamHandle( rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init

rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata rank1: ) = self._validate_tensors_to_flatten(params) rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten rank1: raise ValueError( rank1: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

BenjaminBossan commented 4 months ago

The accelerate issue you mentioned sounds very similar. Do you see the same error when using Q-LoRA (i.e. without DoRA)? Could you try downgrading accelerate and check if this resolves the error?

This info would be really useful to have. If it still breaks, but only with DoRA, it could be a DoRA+FSDP issue, possibly related to the use of nn.ParameterDict.

mallorbc commented 4 months ago

I have no issues using Lora or QLora with FSDP when I install certain versions of the software stack. Naively installing everything from the latest release will not work at this time. Using the software versions I listed above, the sample script I provide, as well as a more complex training program works for these combinations.

I can try downgrading accelerate to 0.29.3 later(when my training with QLora FSDP is finished).

I have tried PEFT from the main branch with the latest release of everything else. This allowed me to train FSDP with Lora/QLora.

Another combination that worked is using the latest released version of PEFT with accelerate 0.29.3. Using the main branch install of PEFT did not fix that as you can see in the other issue.

So the options to get FSDP QLora working are: PEFT main, everything else latest accelerate<=0.29.3 with everything else latest

What I will try: accelerate<=0.29.3 with PEFT main installed and the latest for everything else.

I will share what I find when my system is idle to test these things.

BenjaminBossan commented 4 months ago

Update: DoRA and QDoRA training with FSDP should be fixed in #1806. If you install from the latest PEFT main, it should thus work. Please also check the PR description on how this was tested.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.