vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.59k stars 3.9k forks source link

[Usage]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm' #5804

Closed mikestut closed 2 months ago

mikestut commented 2 months ago

Your current environment

When I send the cl:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-1.5B-Instruct --served-model-name Qwen2-7B-Instruct-lora --max-model-len=2048 --dtype=half

Then Response:

WARNING 06-24 21:44:48 _custom_ops.py:14] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory') INFO 06-24 21:44:54 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 06-24 21:44:54 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='Qwen/Qwen2-1.5B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['Qwen2-7B-Instruct-lora'], qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 06-24 21:44:54 config.py:1222] Casting torch.bfloat16 to torch.float16. INFO 06-24 21:44:54 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='Qwen/Qwen2-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen2-7B-Instruct-lora) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-24 21:44:55 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 06-24 21:44:55 selector.py:51] Using XFormers backend. WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.3.0+cu118) Python 3.10.14 (you have 3.10.9) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details INFO 06-24 21:44:56 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 06-24 21:44:56 selector.py:51] Using XFormers backend. INFO 06-24 21:45:26 model_runner.py:160] Loading model weights took 2.8875 GB ERROR 06-24 21:45:26 _custom_ops.py:42] Error in calling custom op rms_norm: '_OpNamespace' '_C' object has no attribute 'rms_norm' ERROR 06-24 21:45:26 _custom_ops.py:42] Possibly you have built or installed an obsolete version of vllm. ERROR 06-24 21:45:26 _custom_ops.py:42] Please try a clean build and install of vllm,or remove old built files such as vllm/cpython.so and build/ . rank0: Traceback (most recent call last): rank0: File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 196, in rank0: engine = AsyncLLMEngine.from_engine_args( rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args rank0: engine = cls( rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init rank0: self.engine = self._init_engine(args, *kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init

rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches

rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks rank0: return self.driver_worker.determine_num_available_blocks() rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks

rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run rank0: self.execute_model(seqs, kv_caches) rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model rank0: hidden_states = model_executable( rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 330, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward rank0: hidden_states, residual = layer( rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 202, in forward rank0: hidden_states = self.input_layernorm(hidden_states) rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/custom_op.py", line 13, in forward rank0: return self._forward_method(args, **kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py", line 62, in forward_cuda

rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/_custom_ops.py", line 43, in wrapper rank0: raise e rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/_custom_ops.py", line 34, in wrapper rank0: return fn(*args, **kwargs) rank0: File "/root/miniconda3/lib/python3.10/site-packages/vllm/_custom_ops.py", line 154, in rms_norm rank0: torch.ops._C.rms_norm(out, input, weight, epsilon) rank0: File "/root/miniconda3/lib/python3.10/site-packages/torch/_ops.py", line 921, in getattr rank0: raise AttributeError(

env:

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

nvidia-smi image

How would you like to use vllm

I want to try: pip install rms_norm

But it not work. WARNING 06-24 21:44:48 _custom_ops.py:14] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory').....again

DarkLight1337 commented 2 months ago

Please follow the instructions in the error message

ERROR 06-24 21:45:26 _custom_ops.py:42] Possibly you have built or installed an obsolete version of vllm. ERROR 06-24 21:45:26 _custom_ops.py:42] Please try a clean build and install of vllm,or remove old built files such as vllm/cpython.so and build/ .