hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.07k stars 4.2k forks source link

vllm多卡推理遇到的问题 #4893

Open ConniePK opened 3 months ago

ConniePK commented 3 months ago

Reminder

System Info

absl-py 2.1.0 accelerate 0.32.0 aiofiles 23.2.1 aiohttp 3.9.1 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.6.0 anyio 4.4.0 apex 0.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.2.0 audioread 3.0.1 auto_gptq 0.7.1 beautifulsoup4 4.12.3 bitsandbytes 0.43.1 bleach 6.1.0 blis 0.7.11 cachetools 5.3.2 catalogue 2.0.10 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 cmake 3.28.1 coloredlogs 15.0.1 comm 0.2.1 confection 0.1.4 contourpy 1.2.0 cubinlinker 0.3.0+2.g405ac64 cuda-python 12.3.0rc4+9.gdb8c48a.dirty cudf 23.12.0 cugraph 23.12.0 cugraph-dgl 23.12.0 cugraph-service-client 23.12.0 cugraph-service-server 23.12.0 cuml 23.12.0 cupy-cuda12x 12.3.0 cycler 0.12.1 cymem 2.0.8 Cython 3.0.8 dask 2023.11.0 dask-cuda 23.12.0 dask-cudf 23.12.0 datasets 2.20.0 debugpy 1.8.1 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.8 diskcache 5.6.3 distributed 2023.11.0 distro 1.9.0 dm-tree 0.1.8 dnspython 2.6.1 docstring_parser 0.16 einops 0.7.0 email_validator 2.2.0 exceptiongroup 1.2.0 execnet 2.0.2 executing 2.0.1 expecttest 0.1.3 fastapi 0.111.0 fastapi-cli 0.0.4 fastjsonschema 2.19.1 fastrlock 0.8.2 ffmpy 0.3.2 filelock 3.13.1 fire 0.6.0 fonttools 4.48.1 frozenlist 1.4.1 fsspec 2023.12.2 gast 0.5.4 gekko 1.2.1 google-auth 2.27.0 google-auth-oauthlib 0.4.6 gradio 4.37.2 gradio_client 1.0.2 graphsurgeon 0.4.6 grpcio 1.60.1 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.4 humanfriendly 10.0 hypothesis 5.35.1 idna 3.6 importlib-metadata 7.0.1 importlib_resources 6.4.0 iniconfig 2.0.0 intel-openmp 2021.4.0 interegular 0.3.3 ipykernel 6.29.2 ipython 8.21.0 ipython-genutils 0.2.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.3 joblib 1.3.2 json5 0.9.14 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab_pygments 0.3.0 jupyterlab-server 1.2.0 jupytext 1.16.1 kiwisolver 1.4.5 langcodes 3.3.0 lark 1.1.9 lazy_loader 0.3 librosa 0.10.1 llamafactory 0.8.3.dev0 /app/LLaMA-Factory llvmlite 0.40.1 lm-format-enforcer 0.10.3 locket 1.0.0 Markdown 3.5.2 markdown-it-py 3.0.0 MarkupSafe 2.1.4 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.4.0 mdurl 0.1.2 mistune 3.0.2 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 5.1.0 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.16 murmurhash 1.0.10 nbclient 0.9.0 nbconvert 7.16.0 nbformat 5.9.2 nest-asyncio 1.6.0 networkx 2.6.3 ninja 1.11.1.1 nltk 3.8.1 notebook 6.4.10 numba 0.57.1+1.g1ff679645 numpy 1.24.4 nvfuser 0.1.4a0+d0bb811 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-dali-cuda120 1.34.0 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.82 nvidia-nvtx-cu12 12.1.105 nvidia-pyindex 1.0.9 nvtx 0.2.5 oauthlib 3.2.2 onnx 1.15.0rc2 openai 1.35.10 opencv 4.7.0 optimum 1.17.0 optree 0.10.0 orjson 3.10.6 outlines 0.0.46 packaging 23.2 pandas 2.2.2 pandocfilters 1.5.1 parso 0.8.3 partd 1.4.1 peft 0.11.1 pexpect 4.9.0 pillow 10.2.0 pip 24.1.1 platformdirs 4.2.0 pluggy 1.4.0 ply 3.11 polygraphy 0.49.4 pooch 1.8.0 preshed 3.0.9 prettytable 3.9.0 prometheus-client 0.19.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.43 protobuf 4.24.4 psutil 5.9.4 ptxcompiler 0.8.1+2.g0d406d6 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 16.1.0 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pybind11 2.11.1 pybind11-global 2.11.1 pycocotools 2.0+nv0.8.0 pycountry 24.6.1 pycparser 2.21 pydantic 2.6.1 pydantic_core 2.16.2 pydub 0.25.1 Pygments 2.17.2 pylibcugraph 23.12.0 pylibcugraphops 23.12.0 pylibraft 23.12.0 pynvml 11.4.1 pyparsing 3.1.1 pytest 8.0.0 pytest-flakefinder 1.1.0 pytest-rerunfailures 13.0 pytest-shard 0.1.2 pytest-xdist 3.5.0 python-dateutil 2.8.2 python-dotenv 1.0.1 python-hostlist 1.23.0 python-multipart 0.0.9 pytorch-quantization 2.1.2 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.2 raft-dask 23.12.0 rapids-dask-dependency 23.12.1 ray 2.31.0 referencing 0.33.0 regex 2023.12.25 requests 2.32.3 requests-oauthlib 1.3.1 rich 13.7.0 rmm 23.12.0 rouge 1.0.1 rouge-chinese 1.0.3 rpds-py 0.17.1 rsa 4.9 ruff 0.5.0 safetensors 0.4.3 scikit-learn 1.2.0 scipy 1.12.0 semantic-version 2.10.0 Send2Trash 1.8.2 sentencepiece 0.2.0 setuptools 68.2.2 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 smart-open 6.4.0 sniffio 1.3.1 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.5 soxr 0.3.7 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 sphinx_glpi_theme 0.6 srsly 2.4.8 sse-starlette 2.1.2 stack-data 0.6.3 starlette 0.37.2 sympy 1.12 tabulate 0.9.0 tbb 2021.11.0 tblib 3.0.0 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 8.6.3 termcolor 2.4.0 terminado 0.18.0 thinc 8.2.3 threadpoolctl 3.2.0 thriftpy2 0.4.17 tiktoken 0.7.0 tinycss2 1.2.1 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.1 torch-tensorrt 2.3.0a0 torchdata 0.7.1a0 torchtext 0.17.0a0 torchvision 0.18.1 tornado 6.4 tqdm 4.66.4 traitlets 5.9.0 transformers 4.42.4 treelite 3.9.1 treelite-runtime 3.9.1 triton 2.3.1 trl 0.9.4 typer 0.12.3 types-dataclasses 0.6.6 typing_extensions 4.9.0 tyro 0.8.5 tzdata 2024.1 ucx-py 0.35.0 uff 0.6.9 ujson 5.10.0 urllib3 2.2.2 uvicorn 0.30.1 uvloop 0.19.0 vllm 0.5.2 vllm-flash-attn 2.5.9.post1 wasabi 1.1.2 watchfiles 0.22.0 wcwidth 0.2.13 weasel 0.3.4 webencodings 0.5.1 websockets 11.0.3 Werkzeug 3.0.1 wheel 0.42.0 xdoctest 1.0.2 xformers 0.0.27 xgboost 1.7.6 xxhash 3.4.1 yarl 1.9.4 zhon 2.0.2 zict 3.0.0 zipp 3.17.0

Reproduction

单卡推理正常,多卡报错,vllm已更新至最新版本

运行:

CUDA_VISIBLE_DEVICES=3,4 API_PORT=9001 VLLM_WORKER_MULTIPROC_METHOD=spawn llamafactory-cli api vllm.yaml 

vllm.yaml如下:
model_name_or_path: '/app/export_output/v46_gptq'
template: qwen
vllm_maxlen: 8000
infer_backend: vllm
vllm_enforce_eager: true
vllm_gpu_util: 0.7

模型为qwen2-72b,gptq量化的int4模型

报错如下:
INFO 07-19 07:31:56 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=39407) INFO 07-19 07:31:56 utils.py:737] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=39407) INFO 07-19 07:31:56 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=39407) INFO 07-19 07:31:56 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_3,4.json
INFO 07-19 07:31:56 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_3,4.json
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Weight input_size_per_partition = 14784 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq., Traceback (most recent call last):
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 256, in load_model
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     return model_class(config=model_config.hf_config,
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     self.model = Qwen2Model(config, cache_config, quant_config)
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 240, in __init__
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     self.layers = nn.ModuleList([
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 241, in <listcomp>
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     Qwen2DecoderLayer(config, cache_config, quant_config)
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 183, in __init__
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     self.mlp = Qwen2MLP(
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 68, in __init__
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     self.down_proj = RowParallelLinear(intermediate_size,
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 693, in __init__
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     self.quant_method.create_weights(
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 157, in create_weights
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     verify_marlin_supports_shape(
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 68, in verify_marlin_supports_shape
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226]     raise ValueError(f"Weight input_size_per_partition = "
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226] ValueError: Weight input_size_per_partition = 14784 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
(VllmWorkerProcess pid=39407) ERROR 07-19 07:31:57 multiproc_worker_utils.py:226] 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/bin/llamafactory-cli", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:   File "/app/LLaMA-Factory/src/llamafactory/cli.py", line 79, in main
[rank0]:     run_api()
[rank0]:   File "/app/LLaMA-Factory/src/llamafactory/api/app.py", line 117, in run_api
[rank0]:     chat_model = ChatModel()
[rank0]:   File "/app/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 45, in __init__
[rank0]:     self.engine: "BaseEngine" = VllmEngine(model_args, data_args, finetuning_args, generating_args)
[rank0]:   File "/app/LLaMA-Factory/src/llamafactory/chat/vllm_engine.py", line 102, in __init__
[rank0]:     self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 373, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 520, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 158, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 150, in __init__
[rank0]:     super().__init__(model_config, cache_config, parallel_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 84, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 256, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
[rank0]:     self.model = Qwen2Model(config, cache_config, quant_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 240, in __init__
[rank0]:     self.layers = nn.ModuleList([
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 241, in <listcomp>
[rank0]:     Qwen2DecoderLayer(config, cache_config, quant_config)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 183, in __init__
[rank0]:     self.mlp = Qwen2MLP(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 68, in __init__
[rank0]:     self.down_proj = RowParallelLinear(intermediate_size,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 693, in __init__
[rank0]:     self.quant_method.create_weights(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 157, in create_weights
[rank0]:     verify_marlin_supports_shape(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 68, in verify_marlin_supports_shape
[rank0]:     raise ValueError(f"Weight input_size_per_partition = "
[rank0]: ValueError: Weight input_size_per_partition = 14784 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Expected behavior

No response

Others

No response

WeiminLee commented 3 months ago

我也遇到这个问题,暂时没没有解决

xliu1991 commented 3 months ago

有解决吗

chocoHunter commented 3 months ago

同样问题

Muieay commented 3 months ago

unsloth 免费版本仅支持 2x GPU 并行,更多的 GPU 需要申请商业版。 #4105

wyclike commented 3 months ago

同样的问题

ConniePK commented 2 months ago

unsloth 免费版本仅支持 2x GPU 并行,更多的 GPU 需要申请商业版。 #4105

我没开unsloth啊

wyclike commented 2 months ago

试一下卸载flashattention

ConniePK commented 2 months ago

VLLM_WORKER_MULTIPROC_METHOD=spawn

确实,卸载了就好了

jasinliu commented 2 weeks ago

VLLM_WORKER_MULTIPROC_METHOD=spawn

确实,卸载了就好了

什么意思,不支持flashattention加速了吗