vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.86k stars 4.11k forks source link

[Bug]: vllm_C is missing. #4083

Closed Calvinnncy97 closed 3 months ago

Calvinnncy97 commented 5 months ago

Your current environment

Previous fix from https://github.com/vllm-project/vllm/pull/3913 did not seem to work. Same issue still encountered.

Collecting environment information...
INFO 04-15 07:13:37 pynccl.py:58] Loading nccl from library /home/me/.config/vllm/nccl/cu12/libnccl.so.2.18.1
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: 11.0.1-2
CMake version: version 3.29.2
Libc version: glibc-2.31

Python version: 3.9.2 (default, Feb 28 2021, 17:03:44)  [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.16.0-0.bpo.4-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.54.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-255
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7763 64-Core Processor
Stepping:                        1
CPU MHz:                         2381.263
BogoMIPS:                        4890.70
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS PXB SYS SYS 48-63,176-191   3       N/A
GPU1    NV12     X  NV12    NV12    NV12    NV12    NV12    NV12    SYS PXB SYS SYS 48-63,176-191   3       N/A
GPU2    NV12    NV12     X  NV12    NV12    NV12    NV12    NV12    PXB SYS SYS SYS 16-31,144-159   1       N/A
GPU3    NV12    NV12    NV12     X  NV12    NV12    NV12    NV12    PXB SYS SYS SYS 16-31,144-159   1       N/A
GPU4    NV12    NV12    NV12    NV12     X  NV12    NV12    NV12    SYS SYS SYS PXB 112-127,240-255 7       N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X  NV12    NV12    SYS SYS SYS PXB 112-127,240-255 7       N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X  NV12    SYS SYS PXB SYS 80-95,208-223   5       N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X  SYS SYS PXB SYS 80-95,208-223   5       N/A
NIC0    SYS SYS PXB PXB SYS SYS SYS SYS  X  SYS SYS SYS             
NIC1    PXB PXB SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS             
NIC2    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS  X  SYS             
NIC3    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS  X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

🐛 Describe the bug


--model facebook/opt-125m
INFO 04-15 07:11:52 pynccl.py:58] Loading nccl from library /home/team/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 04-15 07:11:53 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-15 07:11:53 api_server.py:150] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='facebook/opt-125m', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, tensorizer_uri=None, verify_hash=False, encryption_keyfile=None, num_readers=1, s3_access_key_id=None, s3_secret_access_key=None, s3_endpoint=None, vllm_tensorized=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 212kB/s]
INFO 04-15 07:11:53 llm_engine.py:82] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 269kB/s]
vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 3.76MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 954kB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 564kB/s]
INFO 04-15 07:11:59 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 04-15 07:11:59 selector.py:33] Using XFormers backend.
INFO 04-15 07:12:00 weight_utils.py:197] Using model weights format ['*.bin']
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:00<00:00, 385MB/s]
INFO 04-15 07:12:02 model_runner.py:169] Loading model weights took 0.2389 GB
INFO 04-15 07:12:02 gpu_executor.py:80] # GPU blocks: 127977, # CPU blocks: 7281
INFO 04-15 07:12:04 model_runner.py:967] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-15 07:12:04 model_runner.py:971] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/team/calvinn/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/team/calvinn/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/team/calvinn/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/engine/llm_engine.py", line 133, in __init__
    self._initialize_kv_caches()
  File "/home/team/calvinn/vllm/vllm/engine/llm_engine.py", line 204, in _initialize_kv_caches
    self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/home/team/calvinn/vllm/vllm/executor/gpu_executor.py", line 83, in initialize_cache
    self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/home/team/calvinn/vllm/vllm/worker/worker.py", line 175, in initialize_cache
    self._warm_up_model()
  File "/home/team/calvinn/vllm/vllm/worker/worker.py", line 186, in _warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/worker/model_runner.py", line 1035, in capture_model
    graph_runner.capture(
  File "/home/team/calvinn/vllm/vllm/worker/model_runner.py", line 1087, in capture
    self.model(
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/model_executor/models/opt.py", line 300, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/model_executor/models/opt.py", line 275, in forward
    return self.decoder(input_ids, positions, kv_caches, attn_metadata)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/model_executor/models/opt.py", line 249, in forward
    hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/model_executor/models/opt.py", line 157, in forward
    hidden_states = self.self_attn(hidden_states=hidden_states,
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/model_executor/models/opt.py", line 101, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/team/calvinn/vllm/vllm/attention/layer.py", line 48, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/home/team/calvinn/vllm/vllm/attention/backends/xformers.py", line 200, in forward
    PagedAttention.write_to_paged_cache(key, value, key_cache,
  File "/home/team/calvinn/vllm/vllm/attention/ops/paged_attn.py", line 72, in write_to_paged_cache
    ops.reshape_and_cache(
  File "/home/team/calvinn/vllm/vllm/_custom_ops.py", line 175, in reshape_and_cache
    vllm_cache_ops.reshape_and_cache(key, value, key_cache, value_cache,
NameError: name 'vllm_cache_ops' is not defined```
Calvinnncy97 commented 5 months ago

Perhaps I am missing something, from _custom_ops.py, it seems the new implementation still relies on vllm._C being available.

DIYer22 commented 5 months ago

I observed the same BUG in the CPU version Docker as well. This bug is quite peculiar, and I accidentally bypassed it by entering Python with an incorrect command:

python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))'

The incorrect path after the -i is crucial, and shortening it still causes an error. I can reproduce this bug 100% consistently in my environment.

Here is the log for BUG reproduction and bypassed.

root@bj:/workspace/vllm# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))
INFO 04-16 18:31:00 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 04-16 18:31:00 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
WARNING 04-16 18:31:00 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.
INFO 04-16 18:31:01 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 04-16 18:31:01 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16.
WARNING 04-16 18:31:01 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 04-16 18:31:01 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 04-16 18:31:01 selector.py:43] Using Torch SDPA backend.
INFO 04-16 18:31:02 weight_utils.py:197] Using model weights format ['*.bin']
INFO 04-16 18:31:03 cpu_executor.py:69] # CPU blocks: 7281
Processed prompts:   0%|                                                       | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/vllm/vllm/entrypoints/llm.py", line 194, in generate
    return self._run_engine(use_tqdm)
  File "/workspace/vllm/vllm/entrypoints/llm.py", line 222, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/workspace/vllm/vllm/engine/llm_engine.py", line 726, in step
    output = self.model_executor.execute_model(
  File "/workspace/vllm/vllm/executor/cpu_executor.py", line 77, in execute_model
    output = self.driver_worker.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/cpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/cpu_model_runner.py", line 394, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 300, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 275, in forward
    return self.decoder(input_ids, positions, kv_caches, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 249, in forward
    hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 157, in forward
    hidden_states = self.self_attn(hidden_states=hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 101, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/attention/layer.py", line 48, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/workspace/vllm/vllm/attention/backends/torch_sdpa.py", line 132, in forward
    PagedAttention.write_to_paged_cache(key, value, key_cache,
  File "/workspace/vllm/vllm/attention/ops/paged_attn.py", line 72, in write_to_paged_cache
    ops.reshape_and_cache(
  File "/workspace/vllm/vllm/_custom_ops.py", line 175, in reshape_and_cache
    vllm_cache_ops.reshape_and_cache(key, value, key_cache, value_cache,
NameError: name 'vllm_cache_ops' is not defined
>>> exit()
Processed prompts:   0%|                                                       | 0/1 [00:05<?, ?it/s]

root@bj:/workspace/vllm# python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))'
python3: can't open file '/workspace/vllm/from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))': [Errno 2] No such file or directory
>>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))
INFO 04-16 18:31:20 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 04-16 18:31:20 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
WARNING 04-16 18:31:20 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.
INFO 04-16 18:31:20 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 04-16 18:31:21 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16.
WARNING 04-16 18:31:21 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 04-16 18:31:21 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 04-16 18:31:21 selector.py:43] Using Torch SDPA backend.
INFO 04-16 18:31:22 weight_utils.py:197] Using model weights format ['*.bin']
INFO 04-16 18:31:22 cpu_executor.py:69] # CPU blocks: 7281
Processed prompts: 100%|███████████████████████████████████████████████| 1/1 [00:01<00:00,  1.40s/it]
[RequestOutput(request_id=0, prompt='Hi', prompt_token_ids=[2, 30086], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='-Re!\nCool, congratulations!', token_ids=[12, 9064, 328, 50118, 37739, 6, 24285, 328, 2], cumulative_logprob=-34.01702481508255, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1713292283.3820164, last_token_time=1713292283.3820164, first_scheduled_time=1713292283.3846822, first_token_time=1713292283.4282758, time_in_queue=0.0026657581329345703, finished_time=1713292284.7867177), lora_request=None)]
>>> 
leng-yue commented 5 months ago

I observed the same BUG in the CPU version Docker as well. This bug is quite peculiar, and I accidentally bypassed it by entering Python with an incorrect command:

python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))'

The incorrect path after the -i is crucial, and shortening it still causes an error. I can reproduce this bug 100% consistently in my environment.

Here is the log for BUG reproduction and bypassed.

root@bj:/workspace/vllm# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))
INFO 04-16 18:31:00 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 04-16 18:31:00 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
WARNING 04-16 18:31:00 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.
INFO 04-16 18:31:01 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 04-16 18:31:01 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16.
WARNING 04-16 18:31:01 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 04-16 18:31:01 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 04-16 18:31:01 selector.py:43] Using Torch SDPA backend.
INFO 04-16 18:31:02 weight_utils.py:197] Using model weights format ['*.bin']
INFO 04-16 18:31:03 cpu_executor.py:69] # CPU blocks: 7281
Processed prompts:   0%|                                                       | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/vllm/vllm/entrypoints/llm.py", line 194, in generate
    return self._run_engine(use_tqdm)
  File "/workspace/vllm/vllm/entrypoints/llm.py", line 222, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/workspace/vllm/vllm/engine/llm_engine.py", line 726, in step
    output = self.model_executor.execute_model(
  File "/workspace/vllm/vllm/executor/cpu_executor.py", line 77, in execute_model
    output = self.driver_worker.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/cpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/cpu_model_runner.py", line 394, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 300, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 275, in forward
    return self.decoder(input_ids, positions, kv_caches, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 249, in forward
    hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 157, in forward
    hidden_states = self.self_attn(hidden_states=hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/models/opt.py", line 101, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/attention/layer.py", line 48, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/workspace/vllm/vllm/attention/backends/torch_sdpa.py", line 132, in forward
    PagedAttention.write_to_paged_cache(key, value, key_cache,
  File "/workspace/vllm/vllm/attention/ops/paged_attn.py", line 72, in write_to_paged_cache
    ops.reshape_and_cache(
  File "/workspace/vllm/vllm/_custom_ops.py", line 175, in reshape_and_cache
    vllm_cache_ops.reshape_and_cache(key, value, key_cache, value_cache,
NameError: name 'vllm_cache_ops' is not defined
>>> exit()
Processed prompts:   0%|                                                       | 0/1 [00:05<?, ?it/s]

root@bj:/workspace/vllm# python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))'
python3: can't open file '/workspace/vllm/from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))': [Errno 2] No such file or directory
>>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))
INFO 04-16 18:31:20 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 04-16 18:31:20 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
WARNING 04-16 18:31:20 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.
INFO 04-16 18:31:20 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 04-16 18:31:21 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16.
WARNING 04-16 18:31:21 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 04-16 18:31:21 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 04-16 18:31:21 selector.py:43] Using Torch SDPA backend.
INFO 04-16 18:31:22 weight_utils.py:197] Using model weights format ['*.bin']
INFO 04-16 18:31:22 cpu_executor.py:69] # CPU blocks: 7281
Processed prompts: 100%|███████████████████████████████████████████████| 1/1 [00:01<00:00,  1.40s/it]
[RequestOutput(request_id=0, prompt='Hi', prompt_token_ids=[2, 30086], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='-Re!\nCool, congratulations!', token_ids=[12, 9064, 328, 50118, 37739, 6, 24285, 328, 2], cumulative_logprob=-34.01702481508255, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1713292283.3820164, last_token_time=1713292283.3820164, first_scheduled_time=1713292283.3846822, first_token_time=1713292283.4282758, time_in_queue=0.0026657581329345703, finished_time=1713292284.7867177), lora_request=None)]
>>> 

This does works for me, but why?....

leng-yue commented 5 months ago

I found using VLLM_TARGET_DEVICE=cpu python setup.py develop can solve this issue.

eigen2017 commented 4 months ago

no module named vllm._C

DamonFool commented 4 months ago

I came across the same kind of bugs today. Finally, I got the failing reason and fixed it just by dumping the error msg: https://github.com/vllm-project/vllm/pull/5282 .