Closed cai-rishabh closed 6 months ago
hi @cai-rishabh can you share the full traceback of the error?
hi @younesbelkada, here is the full traceback:
0%| | 0/58 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/rish/finetuning/cai-llm-finetuning/test_whole copy.py", line 189, in
Hi @younesbelkada, could you please acknowledge if this is a bug (kind of need to know if it is one asap), in case it takes time to resolve this issue?
Could you please update PEFT to the main
version installed from source and check if the error persists?
Hi @BenjaminBossan, thank you for your help. It is working now!
Earlier peft version: 0.8.2 Current peft version: 0.9.1.dev0
Could you please update PEFT to the
main
version installed from source and check if the error persists?
I am using version 0.10.0 of peft but the error still exists. My traceback
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/debugpy/__main__.py", line 39, in <module>
[rank0]: cli.main()
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/debugpy/server/cli.py", line 430, in main
[rank0]: run()
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/debugpy/server/cli.py", line 284, in run_file
[rank0]: runpy.run_path(target, run_name="__main__")
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
[rank0]: return _run_module_code(code, init_globals, run_name,
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
[rank0]: _run_code(code, mod_globals, init_globals,
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "main.py", line 181, in <module>
[rank0]: trainer.train(resume_from_checkpoint=configs.resume_from_checkpoint)
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
[rank0]: return inner_training_loop(
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/transformers/trainer.py", line 2249, in _inner_training_loop
[rank0]: _grad_norm = self.accelerator.clip_grad_norm_(
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/accelerate/accelerator.py", line 2157, in clip_grad_norm_
[rank0]: self.unscale_gradients()
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/accelerate/accelerator.py", line 2107, in unscale_gradients
[rank0]: self.scaler.unscale_(opt)
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 337, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self._unscale_grads_(
[rank0]: File "/data/kdx/soft/anaconda3/envs/sighan/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 259, in _unscale_grads_
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
I am using version 0.10.0 of peft but the error still exists.
What you report seems to be a different error. You seem to be using float16 weights, which causes the error. Please try casting the fp16 weights as described in the docs.
System Info
------- nvidia-smi output ----------
Tesla T4 NVIDIA-SMI 545.23.08
Driver Version: 545.23.08
CUDA Version: 12.3
-------- hostnamectl output --------
Ubuntu 20.04.6 LTS Kernel: Linux 5.15.0-1048-aws Architecture: x86-64
--------- pip freeze output ---------- accelerate==0.27.2 aiohttp==3.9.3 aiosignal==1.3.1 asttokens==2.4.1 async-timeout==4.0.3 attrs==23.2.0 backcall==0.2.0 bitsandbytes==0.42.0 certifi==2024.2.2 charset-normalizer==3.3.2 comm==0.2.1 datasets==2.17.1 debugpy==1.8.1 decorator==5.1.1 dill==0.3.8 evaluate==0.4.1 executing==2.0.1 filelock==3.13.1 frozenlist==1.4.1 fsspec==2023.10.0 huggingface-hub==0.20.3 idna==3.6 importlib-metadata==7.0.1 ipykernel==6.29.3 ipython==8.12.3 jedi==0.19.1 Jinja2==3.1.3 joblib==1.3.2 jupyter-client==8.6.0 jupyter-core==5.7.1 MarkupSafe==2.1.5 matplotlib-inline==0.1.6 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.1 numpy==1.24.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 packaging==23.2 pandas==2.0.3 parso==0.8.3 peft==0.8.2 pexpect==4.9.0 pickleshare==0.7.5 platformdirs==4.2.0 prompt-toolkit==3.0.43 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==15.0.0 pyarrow-hotfix==0.6 pygments==2.17.2 python-dateutil==2.8.2 pytz==2024.1 PyYAML==6.0.1 pyzmq==25.1.2 regex==2023.12.25 requests==2.31.0 responses==0.18.0 safetensors==0.4.2 scikit-learn==1.3.2 scipy==1.10.1 six==1.16.0 stack-data==0.6.3 sympy==1.12 threadpoolctl==3.3.0 tokenizers==0.15.2 torch==2.2.1 tornado==6.4 tqdm==4.66.2 traitlets==5.14.1 transformers==4.38.1 triton==2.2.0 typing-extensions==4.10.0 tzdata==2024.1 urllib3==2.2.1 wcwidth==0.2.13 xxhash==3.4.1 yarl==1.9.4 zipp==3.17.0
Who can help?
No response
Information
Tasks
examples
folderReproduction
Expected behavior
The error arises only when i include the embedding layer 'wte' in the 'target_modules' argument of LoraConfig, otherwise the model starts training.
I received the following error "ValueError: Attempting to unscale FP16 gradients", first when trying to train 'distilgpt2' using quantization, peft (lora), and mixed precision training
Then i added prepare_model_for_kbit_training() after loading the model using bnb config, and post that i received "RuntimeError: a leaf Variable that requires grad is being used in an in-place operation."
I cant find it but i had read somewhere that embedding layers are supported by LoRA, but i did find the following: https://huggingface.co/docs/peft/en/package_reference/lora In the above link under LoraModel there are 2 examples, and i see that 'wte' is added to 'target_modules' list in the 2nd example.
I am unable to find any existing closed issues which solve this error of my mine, hence posting this.
TLDR: while using quanzitzation-peft-fp16 training together, and if adding embedding layer 'wte' to 'target_modules', then getting error "RuntimeError: a leaf Variable that requires grad is being used in an in-place operation."