microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
694 stars 25 forks source link

[Bug]: NameError: name 'cache_ops' is not defined #42

Closed Zoro528 closed 1 month ago

Zoro528 commented 1 month ago

Describe the bug

vllm version==0.4.3 minference==0.1.4.post3 flash-attn==2.5.9.post1 triton==2.3.0 run_vllm.py from https://github.com/microsoft/MInference/blob/main/examples/run_vllm.py

Steps to reproduce

No response

Expected Behavior

No response

Logs

Patched model for minference with vLLM.. Processed prompts: 0%| | 0/4 00:00<?, ?it/s, Generation Speed: 0.00 toks/s: Traceback (most recent call last): rank0: File "/home/code/rag_benchmark/vllm_benchmark/run_vllm.py", line 32, in rank0: outputs = llm.generate(prompts, sampling_params) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/utils.py", line 672, in inner rank0: return fn(*args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 300, in generate rank0: outputs = self._run_engine(use_tqdm=use_tqdm) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 552, in _run_engine rank0: step_outputs = self.llm_engine.step() rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 773, in step rank0: output = self.model_executor.execute_model( rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 91, in execute_model rank0: output = self.driver_worker.execute_model(execute_model_req) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/worker/worker.py", line 272, in execute_model rank0: output = self.model_runner.execute_model(seq_group_metadata_list, rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 728, in execute_model rank0: hidden_states = model_executable(execute_model_kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 330, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward rank0: hidden_states, residual = layer( rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 206, in forward rank0: hidden_states = self.self_attn( rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 153, in forward rank0: attn_output = self.attn(q, k, v, kv_cache, attn_metadata) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/root/tx8kdl2jzhi/envs/rag/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, **kwargs) rank0: File "/home/code/github/MInference/minference/patch.py", line 1085, in vllm_attn_forward rank0: return self.impl.forward( rank0: File "/home/code/github/MInference/minference/modules/minference_forward.py", line 1189, in forward_vllm_043

rank0: NameError: name 'cache_ops' is not defined Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s]

Additional Information

No response

1315577677 commented 1 month ago

i have the same question,how to fix?

iofu728 commented 1 month ago

Hi @Zoro528 and @1315577677, thanks for your feedback. It is fixed in #44. And you can upgrade MInferece to use.

pip install minference -U