Closed mallorbc closed 3 months ago
rank0: Traceback (most recent call last):
rank0: File "trl_finetune.py", line 401, in
rank0: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
rank0: output = super().train(*args, kwargs)
rank0: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
rank0: return inner_training_loop(
rank0: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop
rank0: self.model = self.accelerator.prepare(self.model)
rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare
rank0: result = tuple(
rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in
rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap rank0: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: Previous line repeated 2 more times: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap rank0: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap rank0: return wrapper_cls(module, kwargs) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init
rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module rank0: _init_param_handle_from_params(state, managed_params, fully_sharded_module) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params rank0: handle = FlatParamHandle( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init
rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata
rank0: ) = self._validate_tensors_to_flatten(params)
rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten
rank0: raise ValueError(
rank0: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
Map: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 20201/20201 [00:01<00:00, 14172.58 examples/s]
Map: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3541/3541 [00:00<00:00, 14188.14 examples/s]
/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a tokenizer with padding_side
not equal to right
to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right'
to your code.
warnings.warn(
rank1: Traceback (most recent call last):
rank1: File "trl_finetune.py", line 401, in
rank1: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
rank1: output = super().train(*args, kwargs)
rank1: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
rank1: return inner_training_loop(
rank1: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop
rank1: self.model = self.accelerator.prepare(self.model)
rank1: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare
rank1: result = tuple(
rank1: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in
rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap rank1: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank1: wrapped_child, num_wrapped_params = _recursive_wrap( rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank1: wrapped_child, num_wrapped_params = _recursive_wrap( rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank1: wrapped_child, num_wrapped_params = _recursive_wrap( rank1: Previous line repeated 2 more times: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap rank1: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap rank1: return wrapper_cls(module, kwargs) rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init
rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module rank1: _init_param_handle_from_params(state, managed_params, fully_sharded_module) rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params rank1: handle = FlatParamHandle( rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init
rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata rank1: ) = self._validate_tensors_to_flatten(params) rank1: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten rank1: raise ValueError( rank1: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32
The accelerate issue you mentioned sounds very similar. Do you see the same error when using Q-LoRA (i.e. without DoRA)? Could you try downgrading accelerate and check if this resolves the error?
This info would be really useful to have. If it still breaks, but only with DoRA, it could be a DoRA+FSDP issue, possibly related to the use of nn.ParameterDict
.
I have no issues using Lora or QLora with FSDP when I install certain versions of the software stack. Naively installing everything from the latest release will not work at this time. Using the software versions I listed above, the sample script I provide, as well as a more complex training program works for these combinations.
I can try downgrading accelerate to 0.29.3 later(when my training with QLora FSDP is finished).
I have tried PEFT from the main branch with the latest release of everything else. This allowed me to train FSDP with Lora/QLora.
Another combination that worked is using the latest released version of PEFT with accelerate 0.29.3. Using the main branch install of PEFT did not fix that as you can see in the other issue.
So the options to get FSDP QLora working are: PEFT main, everything else latest accelerate<=0.29.3 with everything else latest
What I will try: accelerate<=0.29.3 with PEFT main installed and the latest for everything else.
I will share what I find when my system is idle to test these things.
Update: DoRA and QDoRA training with FSDP should be fixed in #1806. If you install from the latest PEFT main, it should thus work. Please also check the PR description on how this was tested.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
System Info
Package Version
accelerate 0.30.1 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 datasets 2.19.1 deepspeed 0.14.2+5f631abc dill 0.3.8 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 eval_type_backport 0.2.0 exceptiongroup 1.2.1 filelock 3.14.0 flash-attn 2.5.8 frozenlist 1.4.1 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 hf_transfer 0.1.6 hjson 3.1.0 huggingface-hub 0.23.0 idna 3.7 iniconfig 2.0.0 Jinja2 3.1.4 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.1 ninja 1.11.1.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.0.3 peft 0.11.1.dev0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 pluggy 1.5.0 protobuf 3.20.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.1 pydantic_core 2.18.2 Pygments 2.18.0 pynvml 11.5.0 pytest 8.2.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.31.0 rich 13.7.1 safetensors 0.4.3 scipy 1.10.1 sentencepiece 0.2.0 sentry-sdk 2.2.0 setproctitle 1.3.3 setuptools 69.5.1 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sympy 1.12 text-generation 0.7.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.0 torchaudio 2.3.0 torchvision 0.18.0 tqdm 4.66.4 transformers 4.40.2 triton 2.3.0 trl 0.8.6 typing_extensions 4.11.0 tyro 0.8.4 tzdata 2024.1 urllib3 2.2.1 wandb 0.17.0 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4
I am using two RTX 3090s with Ubuntu 12.2.2 inside of a docker container.
Regular Lora/QLora with FSDP works.
Not sure where this should go. Either PEFT or accelerate I would guess.
I feel like this issue might be related to the following: https://github.com/huggingface/peft/issues/1674 https://github.com/huggingface/accelerate/issues/2761 https://github.com/huggingface/peft/issues/1593#issuecomment-2116202685
Who can help?
@pacman100 @younesbelkada @BenjaminBossan
Information
Tasks
examples
folderReproduction
Both DDP and FSDP work with regular Lora/QLora
Scripts
Working Dora DDP config
Broken Dora FSDP config
Simple Program To Test
You can use this program to see how it is broken. Running on CPU with regular Dora will be much slower, but it will still work.
Example Uses And Current Results
FSDP Lora
time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -accelerate -flash
11.85 seconds FSDP QLoratime accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -accelerate -flash -int4
16.86 seconds DDP Loratime accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -accelerate -flash
12.84 seconds DDP QLoratime accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -accelerate -flash -int4
12.85 secondsFSDP Dora
time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora -accelerate -flash
killed after waiting 5+ minutes FSDP QDoratime accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora -accelerate -flash -int4
killed after waiting 5+ minutes DDP Doratime accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora -accelerate -flash
12.83 seconds DDP QDoratime accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora -accelerate -flash -int4
12.85 secondsRegular Lora
time python test_dora.py -flash
6.92 Regular Doratime python test_dora.py -flash -dora
6.99 Regular QLoratime python test_dora.py -flash -int4
7.45 Regular QDoratime python test_dora.py -flash -dora -int4
7.52 Regular Lora CPUtime python test_dora.py -flash -cpu
6.886 Regular QLora CPUtime python test_dora.py -flash -cpu -int4
7.16 Regular Dora CPUtime python test_dora.py -flash -cpu -dora
killed after 10+ minutes but I have gotten this working before(or at least I am pretty sure) Regular QDora CPUtime python test_dora.py -flash -cpu -dora --int4
7.10Expected behavior
I would expect the same behavior as regular Lora/QLora. That meaning that training successfully occurs and the sample script runs.