Closed Zoro528 closed 1 month ago
vllm version==0.4.3 minference==0.1.4.post3 flash-attn==2.5.9.post1 triton==2.3.0 run_vllm.py from https://github.com/microsoft/MInference/blob/main/examples/run_vllm.py
No response
Patched model for minference with vLLM.. Processed prompts: 0%| | 0/4 00:00<?, ?it/s, Generation Speed: 0.00 toks/s: Traceback (most recent call last): rank0: File "/home/code/rag_benchmark/vllm_benchmark/run_vllm.py", line 32, in rank0: outputs = llm.generate(prompts, sampling_params) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/utils.py", line 672, in inner rank0: return fn(*args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 300, in generate rank0: outputs = self._run_engine(use_tqdm=use_tqdm) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 552, in _run_engine rank0: step_outputs = self.llm_engine.step() rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 773, in step rank0: output = self.model_executor.execute_model( rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 91, in execute_model rank0: output = self.driver_worker.execute_model(execute_model_req) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/worker/worker.py", line 272, in execute_model rank0: output = self.model_runner.execute_model(seq_group_metadata_list, rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 728, in execute_model rank0: hidden_states = model_executable(execute_model_kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 330, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward rank0: hidden_states, residual = layer( rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 206, in forward rank0: hidden_states = self.self_attn( rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 153, in forward rank0: attn_output = self.attn(q, k, v, kv_cache, attn_metadata) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, **kwargs) rank0: File "/home/code/github/MInference/minference/patch.py", line 1085, in vllm_attn_forward rank0: return self.impl.forward( rank0: File "/home/code/github/MInference/minference/modules/minference_forward.py", line 1189, in forward_vllm_043
rank0: NameError: name 'cache_ops' is not defined Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s]
i have the same question,how to fix?
Hi @Zoro528 and @1315577677, thanks for your feedback. It is fixed in #44. And you can upgrade MInferece to use.
pip install minference -U
Describe the bug
vllm version==0.4.3 minference==0.1.4.post3 flash-attn==2.5.9.post1 triton==2.3.0 run_vllm.py from https://github.com/microsoft/MInference/blob/main/examples/run_vllm.py
Steps to reproduce
No response
Expected Behavior
No response
Logs
Patched model for minference with vLLM.. Processed prompts: 0%| | 0/4 00:00<?, ?it/s, Generation Speed: 0.00 toks/s: Traceback (most recent call last): rank0: File "/home/code/rag_benchmark/vllm_benchmark/run_vllm.py", line 32, in
rank0: outputs = llm.generate(prompts, sampling_params)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/utils.py", line 672, in inner
rank0: return fn(*args, kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 300, in generate
rank0: outputs = self._run_engine(use_tqdm=use_tqdm)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 552, in _run_engine
rank0: step_outputs = self.llm_engine.step()
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 773, in step
rank0: output = self.model_executor.execute_model(
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 91, in execute_model
rank0: output = self.driver_worker.execute_model(execute_model_req)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
rank0: return func(*args, *kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/worker/worker.py", line 272, in execute_model
rank0: output = self.model_runner.execute_model(seq_group_metadata_list,
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
rank0: return func(args, kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 728, in execute_model
rank0: hidden_states = model_executable(execute_model_kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, *kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
rank0: return forward_call(args, kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
rank0: hidden_states = self.model(input_ids, positions, kv_caches,
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
rank0: return forward_call(*args, *kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
rank0: hidden_states, residual = layer(
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(args, kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
rank0: return forward_call(*args, kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
rank0: hidden_states = self.self_attn(
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, *kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
rank0: return forward_call(args, kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 153, in forward
rank0: attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, *kwargs)
rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
rank0: return forward_call(args, **kwargs)
rank0: File "/home/code/github/MInference/minference/patch.py", line 1085, in vllm_attn_forward
rank0: return self.impl.forward(
rank0: File "/home/code/github/MInference/minference/modules/minference_forward.py", line 1189, in forward_vllm_043
rank0: NameError: name 'cache_ops' is not defined Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s]
Additional Information
No response