Closed goodmaney closed 5 days ago
INSTALL_FLASHATTN=true
INSTALL_FLASHATTN=true后安装的是新版本会报错,按照 https://github.com/Dao-AILab/flash-attention/issues/966#issuecomment-2150771661 安装torch==2.3.0、flash-attn==2.5.8 解决undefined symbol: _ZN3c104cuda14ExchangeDeviceEa. flash-attn对4090是必须的吗? 使用上面命令行训练会出现https://github.com/hiyouga/LLaMA-Factory/issues/4441#issue-2369673699 中的错误, SDPA attention是修改那个参数?
应该是 GLM 模型代码的问题,你可以试着更新一下文件:https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/modeling_chatglm.py#L30-L36
INSTALL_FLASHATTN=true
试了多次,发现docker里需要torch==2.1.2 和 pip install flash-attn --no-build-isolation才能跑起来,装了后torchtext和torchvision都得换成0.16.2。上面提到的torch==2.3.0、flash-attn==2.5.8也不行,不知道第一次怎么成功的,是不是和docker里的cuda版本有关?后面试了下docker compose,无论怎么试都跑不了
flash-attn这个东西能不能不调用啊,我用pip install -e .编译的环境装flash-attn就卡死不动了,只能用docker
已经修复了 e3141f5f1b435d12c71d8b1fc6ade6e69deead71
已经修复了 e3141f5
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml可以用了.但用令行加参数llamafactory-cli train --stage sft --do_train True也就是webui界面还是会提示未安装 flash_attn. 尝试docker里尝试安装 flash_attn会报错. 26号下的docker compose在另一台双4090显卡电脑里能运行.报错这台电脑是单4090
exit code: 1 ╰─> [165 lines of output] fatal: not a git repository (or any of the parent directories): .git
torch.__version__ = 2.3.0a0+ebedce2
/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py:80: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
!!
********************************************************************************
Requirements should be satisfied by a PEP 517 installer.
If you are using pip, you can try `pip install --use-pep517`.
********************************************************************************
!!
dist.fetch_build_eggs(dist.setup_requires)
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2095, in _run_ninja_build subprocess.run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '4']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-x_jhgpxb/flash-attn_2f2e7ee88bc743f1bc99623ecc04d0cc/setup.py", line 311, in <module>
setup(
File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/tmp/pip-install-x_jhgpxb/flash-attn_2f2e7ee88bc743f1bc99623ecc04d0cc/setup.py", line 266, in run
return super().run()
File "/usr/local/lib/python3.10/dist-packages/wheel/bdist_wheel.py", line 368, in run
self.run_command("build")
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build.py", line 131, in run
self.run_command(cmd_name)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 88, in run
_build_ext.run(self)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
self.build_extensions()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 870, in build_extensions
build_ext.build_extensions(self)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
self._build_extensions_serial()
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
self.build_extension(ext)
File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 249, in build_extension
_build_ext.build_extension(self, ext)
File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
super(build_ext, self).build_extension(ext)
File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
objects = self.compiler.compile(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 683, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1773, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2111, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for flash-attn Running setup.py clean for flash-attn Failed to build flash-attn ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn)
Reminder
System Info
OS:wsl2 cuda-12.3 最新llamafactory,docker compose
Reproduction
命令行: llamafactory-cli train \ --stage sft \ --do_train True \ --model_name_or_path /home/xx/.cache/modelscope/hub/ZhipuAI/glm-4-9b-chat/ \ --preprocessing_num_workers 16 \ --finetuning_type lora \ --template glm4 \ --dataset_dir data \ --dataset test \ --cutoff_len 1024 \ --learning_rate 5e-05 \ --num_train_epochs 3.0 \ --max_samples 100000 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 5 \ --save_steps 50 \ --warmup_steps 0 \ --optim adamw_torch \ --packing False \ --report_to none \ --output_dir saves/GLM-4-9B-Chat/lora/train_2024-06-27-13-02-26 \ --fp16 True \ --plot_loss True \ --ddp_timeout 180000000 \ --include_num_input_tokens_seen True \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0 \ --lora_target all 命令行加或不加 --flash_attn auto 以及使用 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml .yaml:
model
model_name_or_path: modles/ZhipuAI/glm-4-9b-chat/
method
stage: sft do_train: true finetuning_type: lora lora_target: all
dataset
dataset: test template: glm4 cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: saves/GLM-4-9B-Chat/lora/train_2024-06-27-13-02-26 logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 5.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: true ddp_timeout: 180000000
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
都报错如下 ################################## Traceback (most recent call last): File "/usr/local/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/app/src/llamafactory/cli.py", line 111, in main
run_exp()
File "/app/src/llamafactory/train/tuner.py", line 50, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/app/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/app/src/llamafactory/model/loader.py", line 152, in load_model
model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 550, in from_pretrained
model_class = get_class_from_dynamic_module(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 501, in get_class_from_dynamic_module
final_module = get_cached_module_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 326, in get_cached_module_file
modules_needed = check_imports(resolved_module_file)
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 181, in check_imports
raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run
pip install flash_attn
##################################################################### 安装flash_attn后报错Traceback (most recent call last): File "/usr/local/bin/llamafactory-cli", line 5, in
from llamafactory.cli import main
File "/app/src/llamafactory/init.py", line 17, in
from .cli import VERSION
File "/app/src/llamafactory/cli.py", line 21, in
from . import launcher
File "/app/src/llamafactory/launcher.py", line 15, in
from llamafactory.train.tuner import run_exp
File "/app/src/llamafactory/train/tuner.py", line 27, in
from ..model import load_model, load_tokenizer
File "/app/src/llamafactory/model/init.py", line 15, in
from .loader import load_config, load_model, load_tokenizer
File "/app/src/llamafactory/model/loader.py", line 28, in
from .patcher import patch_config, patch_model, patch_tokenizer, patch_valuehead_model
File "/app/src/llamafactory/model/patcher.py", line 30, in
from .model_utils.longlora import configure_longlora
File "/app/src/llamafactory/model/model_utils/longlora.py", line 25, in
from transformers.models.llama.modeling_llama import (
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 54, in
from flash_attn import flash_attn_func, flash_attn_varlen_func
File "/usr/local/lib/python3.10/dist-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa。
Expected behavior
No response
Others
No response