Closed Calvinnncy97 closed 3 months ago
Perhaps I am missing something, from _custom_ops.py
, it seems the new implementation still relies on vllm._C
being available.
I observed the same BUG in the CPU version Docker as well. This bug is quite peculiar, and I accidentally bypassed it by entering Python with an incorrect command:
python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))'
The incorrect path after the -i
is crucial, and shortening it still causes an error. I can reproduce this bug 100% consistently in my environment.
Here is the log for BUG reproduction and bypassed.
root@bj:/workspace/vllm# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))
INFO 04-16 18:31:00 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 04-16 18:31:00 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
WARNING 04-16 18:31:00 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.
INFO 04-16 18:31:01 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 04-16 18:31:01 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16.
WARNING 04-16 18:31:01 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 04-16 18:31:01 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 04-16 18:31:01 selector.py:43] Using Torch SDPA backend.
INFO 04-16 18:31:02 weight_utils.py:197] Using model weights format ['*.bin']
INFO 04-16 18:31:03 cpu_executor.py:69] # CPU blocks: 7281
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/workspace/vllm/vllm/entrypoints/llm.py", line 194, in generate
return self._run_engine(use_tqdm)
File "/workspace/vllm/vllm/entrypoints/llm.py", line 222, in _run_engine
step_outputs = self.llm_engine.step()
File "/workspace/vllm/vllm/engine/llm_engine.py", line 726, in step
output = self.model_executor.execute_model(
File "/workspace/vllm/vllm/executor/cpu_executor.py", line 77, in execute_model
output = self.driver_worker.execute_model(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/vllm/worker/cpu_worker.py", line 276, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/vllm/worker/cpu_model_runner.py", line 394, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/model_executor/models/opt.py", line 300, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/model_executor/models/opt.py", line 275, in forward
return self.decoder(input_ids, positions, kv_caches, attn_metadata)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/model_executor/models/opt.py", line 249, in forward
hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/model_executor/models/opt.py", line 157, in forward
hidden_states = self.self_attn(hidden_states=hidden_states,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/model_executor/models/opt.py", line 101, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/attention/layer.py", line 48, in forward
return self.impl.forward(query, key, value, kv_cache, attn_metadata,
File "/workspace/vllm/vllm/attention/backends/torch_sdpa.py", line 132, in forward
PagedAttention.write_to_paged_cache(key, value, key_cache,
File "/workspace/vllm/vllm/attention/ops/paged_attn.py", line 72, in write_to_paged_cache
ops.reshape_and_cache(
File "/workspace/vllm/vllm/_custom_ops.py", line 175, in reshape_and_cache
vllm_cache_ops.reshape_and_cache(key, value, key_cache, value_cache,
NameError: name 'vllm_cache_ops' is not defined
>>> exit()
Processed prompts: 0%| | 0/1 [00:05<?, ?it/s]
root@bj:/workspace/vllm# python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))'
python3: can't open file '/workspace/vllm/from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))': [Errno 2] No such file or directory
>>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))
INFO 04-16 18:31:20 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 04-16 18:31:20 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
WARNING 04-16 18:31:20 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.
INFO 04-16 18:31:20 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 04-16 18:31:21 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16.
WARNING 04-16 18:31:21 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 04-16 18:31:21 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 04-16 18:31:21 selector.py:43] Using Torch SDPA backend.
INFO 04-16 18:31:22 weight_utils.py:197] Using model weights format ['*.bin']
INFO 04-16 18:31:22 cpu_executor.py:69] # CPU blocks: 7281
Processed prompts: 100%|███████████████████████████████████████████████| 1/1 [00:01<00:00, 1.40s/it]
[RequestOutput(request_id=0, prompt='Hi', prompt_token_ids=[2, 30086], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='-Re!\nCool, congratulations!', token_ids=[12, 9064, 328, 50118, 37739, 6, 24285, 328, 2], cumulative_logprob=-34.01702481508255, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1713292283.3820164, last_token_time=1713292283.3820164, first_scheduled_time=1713292283.3846822, first_token_time=1713292283.4282758, time_in_queue=0.0026657581329345703, finished_time=1713292284.7867177), lora_request=None)]
>>>
I observed the same BUG in the CPU version Docker as well. This bug is quite peculiar, and I accidentally bypassed it by entering Python with an incorrect command:
python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))'
The incorrect path after the
-i
is crucial, and shortening it still causes an error. I can reproduce this bug 100% consistently in my environment.Here is the log for BUG reproduction and bypassed.
root@bj:/workspace/vllm# python3 Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi")) INFO 04-16 18:31:00 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends. INFO 04-16 18:31:00 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs. WARNING 04-16 18:31:00 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`. INFO 04-16 18:31:01 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) WARNING 04-16 18:31:01 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16. WARNING 04-16 18:31:01 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode. WARNING 04-16 18:31:01 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default. INFO 04-16 18:31:01 selector.py:43] Using Torch SDPA backend. INFO 04-16 18:31:02 weight_utils.py:197] Using model weights format ['*.bin'] INFO 04-16 18:31:03 cpu_executor.py:69] # CPU blocks: 7281 Processed prompts: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/workspace/vllm/vllm/entrypoints/llm.py", line 194, in generate return self._run_engine(use_tqdm) File "/workspace/vllm/vllm/entrypoints/llm.py", line 222, in _run_engine step_outputs = self.llm_engine.step() File "/workspace/vllm/vllm/engine/llm_engine.py", line 726, in step output = self.model_executor.execute_model( File "/workspace/vllm/vllm/executor/cpu_executor.py", line 77, in execute_model output = self.driver_worker.execute_model( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workspace/vllm/vllm/worker/cpu_worker.py", line 276, in execute_model output = self.model_runner.execute_model(seq_group_metadata_list, File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workspace/vllm/vllm/worker/cpu_model_runner.py", line 394, in execute_model hidden_states = model_executable(**execute_model_kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/vllm/vllm/model_executor/models/opt.py", line 300, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/vllm/vllm/model_executor/models/opt.py", line 275, in forward return self.decoder(input_ids, positions, kv_caches, attn_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/vllm/vllm/model_executor/models/opt.py", line 249, in forward hidden_states = layer(hidden_states, kv_caches[i], attn_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/vllm/vllm/model_executor/models/opt.py", line 157, in forward hidden_states = self.self_attn(hidden_states=hidden_states, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/vllm/vllm/model_executor/models/opt.py", line 101, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/vllm/vllm/attention/layer.py", line 48, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata, File "/workspace/vllm/vllm/attention/backends/torch_sdpa.py", line 132, in forward PagedAttention.write_to_paged_cache(key, value, key_cache, File "/workspace/vllm/vllm/attention/ops/paged_attn.py", line 72, in write_to_paged_cache ops.reshape_and_cache( File "/workspace/vllm/vllm/_custom_ops.py", line 175, in reshape_and_cache vllm_cache_ops.reshape_and_cache(key, value, key_cache, value_cache, NameError: name 'vllm_cache_ops' is not defined >>> exit() Processed prompts: 0%| | 0/1 [00:05<?, ?it/s] root@bj:/workspace/vllm# python3 -i 'from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))' python3: can't open file '/workspace/vllm/from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi"))': [Errno 2] No such file or directory >>> from vllm import LLM, SamplingParams;llm = LLM(model="facebook/opt-125m");print(llm.generate("Hi")) INFO 04-16 18:31:20 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends. INFO 04-16 18:31:20 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs. WARNING 04-16 18:31:20 ray_utils.py:76] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`. INFO 04-16 18:31:20 llm_engine.py:84] Initializing an LLM engine (v0.4.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) WARNING 04-16 18:31:21 cpu_executor.py:102] float16 is not supported on CPU, casting to bfloat16. WARNING 04-16 18:31:21 cpu_executor.py:105] CUDA graph is not supported on CPU, fallback to the eager mode. WARNING 04-16 18:31:21 cpu_executor.py:133] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default. INFO 04-16 18:31:21 selector.py:43] Using Torch SDPA backend. INFO 04-16 18:31:22 weight_utils.py:197] Using model weights format ['*.bin'] INFO 04-16 18:31:22 cpu_executor.py:69] # CPU blocks: 7281 Processed prompts: 100%|███████████████████████████████████████████████| 1/1 [00:01<00:00, 1.40s/it] [RequestOutput(request_id=0, prompt='Hi', prompt_token_ids=[2, 30086], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='-Re!\nCool, congratulations!', token_ids=[12, 9064, 328, 50118, 37739, 6, 24285, 328, 2], cumulative_logprob=-34.01702481508255, logprobs=None, finish_reason=stop, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1713292283.3820164, last_token_time=1713292283.3820164, first_scheduled_time=1713292283.3846822, first_token_time=1713292283.4282758, time_in_queue=0.0026657581329345703, finished_time=1713292284.7867177), lora_request=None)] >>>
This does works for me, but why?....
I found using VLLM_TARGET_DEVICE=cpu python setup.py develop
can solve this issue.
no module named vllm._C
I came across the same kind of bugs today. Finally, I got the failing reason and fixed it just by dumping the error msg: https://github.com/vllm-project/vllm/pull/5282 .
Your current environment
Previous fix from https://github.com/vllm-project/vllm/pull/3913 did not seem to work. Same issue still encountered.
🐛 Describe the bug