vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.82k stars 4.5k forks source link

Qwen 14B AWQ deploy: AttributeError: 'ndarray' object has no attribute '_torch_dtype' #3033

Open testTech92 opened 8 months ago

testTech92 commented 8 months ago

$ python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8001 --model Qwen1.5-14B-Chat-AWQ --tensor-parallel-size 2 --quantization awq --trust-remote-code --dtype half

INFO 02-26 10:32:53 api_server.py:229] args: Namespace(host='0.0.0.0', port=8061, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='Qwen1.5-14B-Chat-AWQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 02-26 10:32:53 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 02-26 10:32:53 config.py:413] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved. 2024-02-26 10:32:56,211 INFO worker.py:1724 -- Started a local Ray instance. INFO 02-26 10:32:57 llm_engine.py:79] Initializing an LLM engine with config: model='Qwen1.5-14B-Chat-AWQ', tokenizer='Qwen1.5-14B-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 237, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 625, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in _init_engine return engine_class(args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 118, in init self._init_workers_ray(placement_group) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 286, in _init_workers_ray self._run_workers("init_model", cupy_port=get_open_port()) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1014, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 94, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 285, in init_distributed_environment cupy_utils.all_reduce(torch.zeros(1).cuda()) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/cupy_utils.py", line 110, in all_reduce cupy_input._torch_dtype = torch_dtype # pylint: disable=protected-access AttributeError: 'ndarray' object has no attribute '_torch_dtype'

yuezhang030 commented 8 months ago

I have exactly the same issue:

cupy_input._torch_dtype = torch_dtype # pylint: disable=protected-access AttributeError: 'ndarray' object has no attribute '_torch_dtype'

janelu9 commented 8 months ago

that's a problem of cupy,try to uninstall cupy*, then pip install cupy-cuda11x==12.1.0 if you are using CUDA 11.2 ~ 11.x

piulin commented 8 months ago

same issue here. When downgrading to cupy-cuda12x==12.1.0 I get

ImportError: NCCLBackend is not available. Please install cupy.
kuri-leo commented 8 months ago

same issue here. When downgrading to cupy-cuda12x==12.1.0 I get

ImportError: NCCLBackend is not available. Please install cupy.

Same issue here,

vllm-0.3.2+cu118-cp310-cp310-manylinux1_x86_64.whl accidently import cupy-cuda12x==12.1.0 during installing, even the enviroment is cuda11x (installed with conda).

Fixed by pip install cupy-cuda11x==12.1 and python -m cupyx.tools.install_library --library nccl --cuda 11.x.

It's frustrating that simply running pip install cupy-cuda11x==12.1 does not work for me. So I uninstall and reinstall, then it works.

testTech92 commented 8 months ago

that's a problem of cupy,try to uninstall cupy*, then pip install cupy-cuda11x==12.1.0 if you are using CUDA 11.2 ~ 11.x

Great.
It worked with me: pip install cupy-cuda11x==12.1.0 if your cuda version is 11x

piulin commented 8 months ago

same issue here. When downgrading to cupy-cuda12x==12.1.0 I get

ImportError: NCCLBackend is not available. Please install cupy.

Same issue here,

vllm-0.3.2+cu118-cp310-cp310-manylinux1_x86_64.whl accidently import cupy-cuda12x==12.1.0 during installing, even the enviroment is cuda11x (installed with conda).

Fixed by pip install cupy-cuda11x==12.1 and python -m cupyx.tools.install_library --library nccl --cuda 11.x.

It's frustrating that simply running pip install cupy-cuda11x==12.1 does not work for me. So I uninstall and reinstall, then it works.

This worked for me as well.

enze5088 commented 8 months ago

same issue here. When downgrading to cupy-cuda12x==12.1.0 I get

ImportError: NCCLBackend is not available. Please install cupy.

Same issue here,

vllm-0.3.2+cu118-cp310-cp310-manylinux1_x86_64.whl accidently import cupy-cuda12x==12.1.0 during installing, even the enviroment is cuda11x (installed with conda).

Fixed by pip install cupy-cuda11x==12.1 and python -m cupyx.tools.install_library --library nccl --cuda 11.x.

It's frustrating that simply running pip install cupy-cuda11x==12.1 does not work for me. So I uninstall and reinstall, then it works.

It also worked for me as well. Thanks

Beomi commented 8 months ago

I have successfully make it work with these commands:

export VLLM_VERSION=0.3.3
export PYTHON_VERSION=39

pip install https://github.com/vllm-project/vllm/releases/download/v$VLLM_VERSION/vllm-$VLLM_VERSION+cu118-cp$PYTHON_VERSION-cp$PYTHON_VERSION-manylinux1_x86_64.whl

pip uninstall xformers -y
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118

# VLLM 0.3.3 requires torch 2.1.2
pip uninstall torch -y
pip install torch==2.1.2 --upgrade --index-url https://download.pytorch.org/whl/cu118

pip uninstall cupy-cuda12x -y
pip install cupy-cuda11x==12.1
python -m cupyx.tools.install_library --library nccl --cuda 11.x

Although it alerts me this err, but I can ignore them.

vllm 0.3.3+cu118 requires cupy-cuda12x==12.1.0, which is not installed.
vllm 0.3.3+cu118 requires xformers==0.0.23.post1, but you have xformers 0.0.24+cu118 which is incompatible.
xformers 0.0.24+cu118 requires torch==2.2.0, but you have torch 2.1.2+cu118 which is incompatible.
ifsheldon commented 8 months ago

I followed the above instructions, but always failed at python -m cupyx.tools.install_library --library nccl --cuda 11.x due to shared libraries not found. The shared libraries nv* and cu* are installed but not in LD_LIBRARY_PATH as I find them. I think this is because pip will not help you manage env vars. I decided to install most dependencies with mamba (a faster conda, you can just use conda if you like). Here are my steps:

  1. mamba create -n vllm python=3.10 -y: Do NOT use python=3.11 for now since cupy=12.1 not support a slightly newer minor versions of python=3.11 like 3.11.1
  2. mamba install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 cupy=12.1 cuda-version=12.1 -c pytorch -c nvidia -c conda-forge: This installs compatible torch and cupy altogether.
  3. pip install xformers=="0.0.23.post1" --index-url https://download.pytorch.org/whl/cu121: 0.0.23.post1 is the only version compatible with torch=2.1.2
  4. python -m cupyx.tools.install_library --library nccl --cuda 12.x
  5. Optionally pip install modelscope

Good luck in python deps hell

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!