Closed mallorbc closed 1 day ago
cc @younesbelkada @pacman100
@mallorbc Could you try installing PEFT from main and check if the error persists?
So use latest accelerate and install peft from main?
I will do the following: pip install transformers bitsandbytes trl accelerate pip install git+https://github.com/huggingface/peft.git
I will let you know
I did the above setup. Here is my pip list: Package Version
accelerate 0.30.1 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 datasets 2.19.1 deepspeed 0.14.2+5f631abc dill 0.3.8 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 eval_type_backport 0.2.0 exceptiongroup 1.2.1 filelock 3.14.0 flash-attn 2.5.8 frozenlist 1.4.1 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 hf_transfer 0.1.6 hjson 3.1.0 huggingface-hub 0.23.0 idna 3.7 iniconfig 2.0.0 Jinja2 3.1.4 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.1 ninja 1.11.1.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.0.3 peft 0.11.1.dev0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 pluggy 1.5.0 protobuf 3.20.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.1 pydantic_core 2.18.2 Pygments 2.18.0 pynvml 11.5.0 pytest 8.2.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.31.0 rich 13.7.1 safetensors 0.4.3 scipy 1.10.1 sentencepiece 0.2.0 sentry-sdk 2.2.0 setproctitle 1.3.3 setuptools 69.5.1 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sympy 1.12 text-generation 0.7.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.0 torchaudio 2.3.0 torchvision 0.18.0 tqdm 4.66.4 transformers 4.40.2 triton 2.3.0 trl 0.8.6 typing_extensions 4.11.0 tyro 0.8.4 tzdata 2024.1 urllib3 2.2.1 wandb 0.17.0 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4
I can confirm that this lead to successful fine-tuning with QLora with FSDP. However, QDora seems to be broken.
When I try doing FSDP QDora, I get the following issue:
rank0: Traceback (most recent call last):
rank0: File "trl_finetune.py", line 399, in
rank0: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train
rank0: output = super().train(*args, kwargs)
rank0: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train
rank0: return inner_training_loop(
rank0: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop
rank0: self.model = self.accelerator.prepare(self.model)
rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare
rank0: result = tuple(
rank0: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in
rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap rank0: _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignorearg-type: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap rank0: wrapped_child, num_wrapped_params = _recursive_wrap( rank0: Previous line repeated 2 more times: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap rank0: return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap rank0: return wrapper_cls(module, kwargs) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init
rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module rank0: _init_param_handle_from_params(state, managed_params, fully_sharded_module) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params rank0: handle = FlatParamHandle( rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init
rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata rank0: ) = self._validate_tensors_to_flatten(params) rank0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 768, in _validate_tensors_to_flatten rank0: raise ValueError("Cannot flatten integer dtype tensors") rank0: ValueError: Cannot flatten integer dtype tensors
ith QLora with FS
I used the exactly version you mentioned ,and with fsdp+qlora, i got the same "ValueError: Cannot flatten integer dtype tensors"
For QLoRA training with FSDP, please check the updated bitsandbytes docs.
As for QDoRA: Training with FSDP should be fixed in https://github.com/huggingface/peft/pull/1806. If you install from the latest PEFT main, it should thus work. Please also check the PR description on how this was tested.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am using code based on the code here: https://github.com/mallorbc/Finetune_LLMs
Else, the basic steps are the following:
See an error like the following for 0.30.0:
Expected behavior
I expect training to occur without issues. This occurs when I use accelerate 0.29.3