vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.14k stars 3.98k forks source link

[Bug]: Failed to import from vllm._C with ImportError("/lib64/libc.so.6: version `GLIBC_2.32' not found #6562

Closed balcklive closed 2 months ago

balcklive commented 2 months ago

Your current environment

(vllm311) [root@instance-bg8ds9yc pengfei]# python vllm/collect_env.py
Collecting environment information...
WARNING 07-19 14:45:53 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.17

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.102.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 530.30.02
cuDNN version: /root/miniconda3/envs/wizardcoder34/lib/python3.10/site-packages/nvidia/cudnn/lib/libcudnn.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
Stepping:              7
CPU MHz:               2600.000
BogoMIPS:              5200.00
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke avx512_vnni md_clear spec_ctrl intel_stibp arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.1                    pypi_0    pypi
[conda] torchvision               0.18.1                   pypi_0    pypi
[conda] transformers              4.42.4                   pypi_0    pypi
[conda] triton                    2.3.1                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV2     0-15            N/A
GPU1    NV2      X      0-15            N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I installed vllm with this command: pip install vllm

but I got this error when I import vllm

Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import vllm
WARNING 07-19 14:34:32 _custom_ops.py:14] Failed to import from vllm._C with ImportError("/lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/_C.abi3.so)")

when I run :

 python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-7B-Instruct --dtype half
``

I got this:

```text
WARNING 07-19 14:50:28 _custom_ops.py:14] Failed to import from vllm._C with ImportError("/lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/_C.abi3.so)")
INFO 07-19 14:50:30 api_server.py:212] vLLM API server version 0.5.2
INFO 07-19 14:50:30 api_server.py:213] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='Qwen/Qwen2-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 07-19 14:50:30 config.py:1378] Casting torch.bfloat16 to torch.float16.
INFO 07-19 14:50:30 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-19 14:50:31 selector.py:150] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-19 14:50:31 selector.py:53] Using XFormers backend.
INFO 07-19 14:50:32 selector.py:150] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-19 14:50:32 selector.py:53] Using XFormers backend.
INFO 07-19 14:50:33 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 07-19 14:50:40 model_runner.py:266] Loading model weights took 14.2487 GB
ERROR 07-19 14:50:40 _custom_ops.py:42] Error in calling custom op rms_norm: '_OpNamespace' '_C' object has no attribute 'rms_norm'
ERROR 07-19 14:50:40 _custom_ops.py:42] Possibly you have built or installed an obsolete version of vllm.
ERROR 07-19 14:50:40 _custom_ops.py:42] Please try a clean build and install of vllm,or remove old built files such as vllm/*cpython*.so and build/ .
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 282, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 520, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 78, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 923, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1341, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:                                     ^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 336, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 257, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 205, in forward
[rank0]:     hidden_states = self.input_layernorm(hidden_states)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 13, in forward
[rank0]:     return self._forward_method(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/model_executor/layers/layernorm.py", line 62, in forward_cuda
[rank0]:     ops.rms_norm(
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/_custom_ops.py", line 43, in wrapper
[rank0]:     raise e
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/_custom_ops.py", line 34, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/vllm/_custom_ops.py", line 158, in rms_norm
[rank0]:     torch.ops._C.rms_norm(out, input, weight, epsilon)
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm311/lib/python3.11/site-packages/torch/_ops.py", line 921, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
youkaichao commented 2 months ago

can you try https://github.com/vllm-project/vllm/pull/6517 ? you can install the per-commit wheel built by that PR. guide available at https://docs.vllm.ai/en/latest/getting_started/installation.html .

balcklive commented 2 months ago

can you try #6517 ? you can install the per-commit wheel built by that PR. guide available at https://docs.vllm.ai/en/latest/getting_started/installation.html .

I downloaed the released asesst: vllm-0.5.2-cp311-cp311-manylinux1_x86_64.whl then install vllm with: pip install vllm-0.5.2-cp311-cp311-manylinux1_x86_64.whl finally it works!!!!

my next question is, how can I compile such a whl file? Is this command right?:
python setup.py bdist_wheel

youkaichao commented 2 months ago

if you are curious: https://github.com/vllm-project/vllm/blob/main/.github/workflows/publish.yml

balcklive commented 2 months ago

if you are curious: https://github.com/vllm-project/vllm/blob/main/.github/workflows/publish.yml thank you, I will give a shot. By the way this issue will be closed.