使用DoRA微调qwen72b时候，出现报错

ConniePK commented 4 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

absl-py 1.4.0 accelerate 0.30.1 addict 2.4.0 aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 aliyun-python-sdk-core 2.14.0 aliyun-python-sdk-kms 2.16.2 altair 5.0.1 annotated-types 0.6.0 anyio 3.7.0 apex 0.1 appdirs 1.4.4 argcomplete 1.12.3 argilla 0.0.1 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.2.1 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.2.0 audioread 3.0.0 auto-gptq 0.6.0 backcall 0.2.0 backports.zoneinfo 0.2.1 beautifulsoup4 4.11.2 bitsandbytes 0.43.0 bleach 6.0.0 blinker 1.6.2 blis 0.7.9 Brotli 1.0.9 cachetools 5.2.0 catalogue 2.0.8 certifi 2022.12.7 cffi 1.15.1 charset-normalizer 3.0.1 click 8.1.3 cloudpickle 2.2.0 cmake 3.24.1.1 colorama 0.4.6 coloredlogs 15.0.1 comm 0.1.2 confection 0.0.4 contourpy 1.0.6 cpm-kernels 1.0.11 crcmod 1.7 cryptography 41.0.1 cubinlinker 0.2.2+2.g8e13447 cuda-python 12.1.0rc1+1.g9e30ea2.dirty cudf 22.12.0 cugraph 22.12.0 cugraph-dgl 22.12.0 cugraph-service-client 22.12.0 cugraph-service-server 22.12.0 cuml 22.12.0 cupy-cuda12x 12.1.0 cycler 0.11.0 cymem 2.0.7 Cython 0.29.33 dask 2022.11.1 dask-cuda 22.12.0 dask-cudf 22.12.0 dataclasses-json 0.5.7 datasets 2.16.0 debugpy 1.6.6 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.7 diskcache 5.6.3 distributed 2022.11.1 docker-pycreds 0.4.0 docstring-parser 0.15 duckduckgo-search 3.8.3 einops 0.7.0 et-xmlfile 1.1.0 exceptiongroup 1.1.0 execnet 1.9.0 executing 1.2.0 expecttest 0.1.3 faiss-cpu 1.7.4 fastapi 0.110.0 fastjsonschema 2.16.2 fastllm-pytools 0.0.1 fastrlock 0.8.1 ffmpy 0.3.0 filelock 3.9.0 fire 0.5.0 FlagEmbedding 1.1.0 fonttools 4.38.0 frozenlist 1.3.3 fschat 0.2.3 fsspec 2023.6.0 galaxy-fds-sdk 1.4.39 gast 0.4.0 gekko 1.0.6 gitdb 4.0.10 GitPython 3.1.31 google-auth 2.16.0 google-auth-oauthlib 0.4.6 gradio 4.21.0 gradio_client 0.12.0 graphsurgeon 0.4.6 greenlet 2.0.2 grpcio 1.51.1 h11 0.14.0 h2 4.1.0 HeapDict 1.0.1 hpack 4.0.0 httpcore 0.17.2 httptools 0.6.1 httpx 0.24.1 huggingface-hub 0.23.2 humanfriendly 10.0 hyperframe 6.0.1 hypothesis 5.35.1 idna 3.4 importlib-metadata 7.0.1 importlib-resources 5.10.2 iniconfig 2.0.0 intel-openmp 2021.4.0 interegular 0.3.3 ipykernel 6.21.1 ipython 8.9.0 ipython-genutils 0.2.0 jedi 0.18.2 jieba 0.42.1 Jinja2 3.1.2 jmespath 0.10.0 joblib 1.2.0 json5 0.9.11 jsonschema 4.17.3 jupyter_client 8.0.2 jupyter_core 5.2.0 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab-pygments 0.2.2 jupyterlab-server 1.2.0 jupytext 1.14.4 kiwisolver 1.4.4 langchain 0.0.176 langcodes 3.3.0 lark 1.1.9 librosa 0.9.2 linkify-it-py 1.0.3 llvmlite 0.39.1 locket 1.0.0 lxml 4.9.2 Markdown 3.4.1 markdown-it-py 2.1.0 markdown2 2.4.8 MarkupSafe 2.1.1 marshmallow 3.19.0 marshmallow-enum 1.5.1 matplotlib 3.6.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdurl 0.1.2 mistune 2.0.5 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 5.0.1 modelscope 1.12.0 mpmath 1.2.1 msg-parser 1.2.0 msgpack 1.0.4 multidict 6.0.4 multiprocess 0.70.15 murmurhash 1.0.9 mypy-extensions 1.0.0 nbclient 0.7.2 nbconvert 7.2.9 nbformat 5.7.3 nest-asyncio 1.5.6 networkx 2.6.3 ninja 1.11.1.1 nltk 3.8.1 notebook 6.4.10 numba 0.56.4+1.g772622d0d numexpr 2.8.4 numpy 1.22.2 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-dali-cuda110 1.22.0 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 nvidia-pyindex 1.0.9 nvtx 0.2.5 oauthlib 3.2.2 olefile 0.46 onnx 1.13.0 openapi-schema-pydantic 1.2.4 opencv 4.6.0 openpyxl 3.1.2 optimum 1.16.2 orjson 3.9.1 oss2 2.18.4 outlines 0.0.34 packaging 22.0 pandas 1.5.2 pandocfilters 1.5.0 parso 0.8.3 partd 1.3.0 pathtools 0.1.2 pathy 0.10.1 pdfminer.six 20221105 peft 0.11.1 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.2.0 pip 24.0 pkgutil_resolve_name 1.3.10 platformdirs 4.2.0 pluggy 1.0.0 ply 3.11 polygraphy 0.43.1 pooch 1.6.0 preshed 3.0.8 prettytable 3.6.0 prometheus_client 0.20.0 prompt-toolkit 3.0.36 protobuf 3.20.3 psutil 5.9.4 ptxcompiler 0.7.0+27.gd73915e ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 15.0.0 pyarrow-hotfix 0.6 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybind11 2.10.3 pycocotools 2.0+nv0.7.1 pycparser 2.21 pycryptodome 3.20.0 pydantic 2.6.4 pydantic_core 2.16.3 pydeck 0.8.0 pydub 0.25.1 Pygments 2.14.0 pylibcugraph 22.12.0 pylibcugraphops 22.12.0 pylibraft 22.12.0 Pympler 1.0.1 pynvml 11.5.0 pypandoc 1.11 pyparsing 3.0.9 pypdf 3.9.1 pyrsistent 0.19.3 pysqlite3 0.5.0 pytest 7.2.1 pytest-rerunfailures 11.0 pytest-shard 0.1.2 pytest-xdist 3.2.0 python-dateutil 2.8.2 python-docx 0.8.11 python-dotenv 1.0.1 python-hostlist 1.23.0 python-magic 0.4.27 python-multipart 0.0.9 python-pptx 0.6.21 pytorch-quantization 2.1.2 pytz 2022.6 pytz-deprecation-shim 0.1.0.post0 PyYAML 6.0 pyzmq 25.0.0 raft-dask 22.12.0 ray 2.9.3 referencing 0.34.0 regex 2022.10.31 requests 2.28.2 requests-oauthlib 1.3.1 resampy 0.4.2 rich 13.3.1 rmm 22.12.0 rouge 1.0.1 rouge-chinese 1.0.3 rpds-py 0.18.0 rsa 4.9 ruff 0.3.7 safetensors 0.4.2 scikit-learn 0.24.2 scipy 1.6.3 seaborn 0.12.1 semantic-version 2.10.0 Send2Trash 1.8.0 sentence-transformers 2.2.2 sentencepiece 0.1.99 sentry-sdk 1.25.1 setproctitle 1.3.2 setuptools 65.5.1 shellingham 1.5.4 shortuuid 1.0.11 shtab 1.7.0 simplejson 3.19.2 six 1.16.0 smart-open 6.3.0 smmap 5.0.0 sniffio 1.3.0 socksio 1.0.0 some-package 0.1 sortedcontainers 2.4.0 soundfile 0.11.0 soupsieve 2.3.2.post1 spacy 3.5.0 spacy-legacy 3.0.12 spacy-loggers 1.0.4 sphinx-glpi-theme 0.3 SQLAlchemy 2.0.16 srsly 2.4.5 sse-starlette 2.0.0 stack-data 0.6.2 starlette 0.36.3 streamlit 1.26.0 strings-udf 22.12.0 svgwrite 1.4.3 sympy 1.11.1 tbb 2021.8.0 tblib 1.7.0 tenacity 8.2.2 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 8.5.3.1 termcolor 2.4.0 terminado 0.17.1 thinc 8.1.7 threadpoolctl 3.1.0 thriftpy2 0.4.16 tiktoken 0.6.0 tinycss2 1.2.1 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.0 torch 2.1.2 torch-tensorrt 1.4.0.dev0 torchtext 0.13.0a0+fae8e8c torchvision 0.15.0a0 tornado 6.1 tqdm 4.64.1 traitlets 5.9.0 transformer-engine 0.5.0 transformers 4.41.2 transformers-stream-generator 0.0.4 treelite 3.0.1 treelite-runtime 3.0.1 triton 2.1.0 trl 0.8.6 typer 0.9.0 typing_extensions 4.10.0 typing-inspect 0.9.0 tyro 0.7.3 tzdata 2023.3 tzlocal 4.3.1 uc-micro-py 1.0.2 ucx-py 22.12.0 uff 0.6.9 unstructured 0.6.5 urllib3 1.26.13 uvicorn 0.22.0 uvloop 0.19.0 validators 0.22.0 vllm 0.4.0 wasabi 1.1.1 watchdog 3.0.0 watchfiles 0.21.0 wavedrom 2.0.3.post3 wcwidth 0.2.6 webencodings 0.5.1 websockets 11.0.3 Werkzeug 2.2.2 wheel 0.38.4 xdoctest 1.0.2 xgboost 1.7.1 XlsxWriter 3.1.2 xxhash 3.3.0 yapf 0.40.2 yarl 1.9.2 zhon 1.1.5 zict 2.2.0 zipp 3.11.0

Reproduction

CUDA_VISIBLE_DEVICES=0 nohup python3 src/train.py  \
    --stage sft \
    --do_train \
    --model_name_or_path '/root/.cache/modelscope/hub/qwen/Qwen-72B-Chat/' \
    --dataset ai_service_data \
    --template qwen \
    --finetuning_type lora \
    --lora_target all \
    --output_dir $model_output \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 5e-6 \
    --num_train_epochs 5.0 \
    --plot_loss \
    --quantization_bit 4 \
    --use_dora true \
    --fp16 > ft_logs/ai_service_v37.out 2>&1 &

运行结果如下：

[INFO|trainer.py:641] 2024-06-20 08:43:38,771 >> Using auto half precision backend [INFO|trainer.py:2078] 2024-06-20 08:43:39,155 >> Running training [INFO|trainer.py:2079] 2024-06-20 08:43:39,155 >> Num examples = 3,095 [INFO|trainer.py:2080] 2024-06-20 08:43:39,155 >> Num Epochs = 5 [INFO|trainer.py:2081] 2024-06-20 08:43:39,155 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2084] 2024-06-20 08:43:39,155 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2085] 2024-06-20 08:43:39,155 >> Gradient Accumulation steps = 4 [INFO|trainer.py:2086] 2024-06-20 08:43:39,155 >> Total optimization steps = 965 [INFO|trainer.py:2087] 2024-06-20 08:43:39,166 >> Number of trainable parameters = 101,580,800 0%| | 0/965 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( Traceback (most recent call last): File "src/train.py", line 14, in main() File "src/train.py", line 5, in main run_exp() File "/home/work/bin/ChatGLM-For-Rerank/llama-factory/src/llamafactory/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/home/work/bin/ChatGLM-For-Rerank/llama-factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2216, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3238, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3264, in compute_loss outputs = model(inputs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 822, in forward return model_forward(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 810, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/peft/peft_model.py", line 1430, in forward return self.base_model( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/peft/tuners/tuners_utils.py", line 179, in forward return self.model.forward(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1045, in forward transformer_outputs = self.transformer( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 882, in forward outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.8/dist-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 451, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 230, in forward outputs = run_function(args) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 878, in custom_forward return module(inputs, use_cache, output_attentions) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 612, in forward attn_outputs = self.attn( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 527, in forward attention_mask = attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min) RuntimeError: value cannot be converted to type at::Half without overflow

Expected behavior

No response

Others

No response

hiyouga commented 4 months ago

fp16 换成 bf16

ConniePK commented 4 months ago

fp16 换成 bf16

改成bf16，仍然报错： RuntimeError: value cannot be converted to type at::BFloat16 without overflow

hiyouga commented 4 months ago

可能是这个模型不支持 DoRA

ConniePK commented 4 months ago

可能是这个模型不支持 DoRA

网上找到了这个答案，我尝试了是可以的。但不知道是啥原理

我修改了源码modeling_qwen.py中的525行attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min) 改为attention_mask.masked_fill(~causal_mask, -1e4)就可以训练了。 torch.finfo(query.dtype).min 表示选择的数据类型的最小可能值。这个值可能太小，导致溢出错误。例如使用 -1e4 这样的较大值可以避免这种溢出。如果使用-1e5，仍然报相同的错误。

经过测试最小值-65504.0 可以避免这个报错。再小例如-65505.0就会报这个错。-65504.0对应的是半精度浮点数float16的最小值。

hiyouga / LLaMA-Factory