microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
681 stars 23 forks source link

[Question]: ModuleNotFoundError: No module named 'minference.cuda' #45

Open lai-serena opened 1 month ago

lai-serena commented 1 month ago

Describe the issue

I encountered some issues when using minference in Python. import minference The problem is Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/workspace/MInference/minference/__init__.py", line 8, in <module> from .models_patch import MInference File "/workspace/MInference/minference/models_patch.py", line 7, in <module> from .patch import minference_patch, minference_patch_vllm, patch_hf File "/workspace/MInference/minference/patch.py", line 12, in <module> from .modules.minference_forward import ( File "/workspace/MInference/minference/modules/minference_forward.py", line 20, in <module> from ..ops.pit_sparse_flash_attention_v2 import vertical_slash_sparse_attention File "/workspace/MInference/minference/ops/pit_sparse_flash_attention_v2.py", line 10, in <module> from ..cuda import convert_vertical_slash_indexes ModuleNotFoundError: No module named 'minference.cuda'

enviornment : Python 3.10.14 minference 0.1.4.post3 triton 2.1.0 torch 2.3.0 CUDA 11.8 vllm 0.4.2+cu118 flash-attn 2.5.8

iofu728 commented 1 month ago

Hi @lai-serena, thanks for your feedback.

It looks like the build was unsuccessful. You can try installing it using the following method:

pip uninstall minference -y
MINFERENCE_FORCE_BUILD=TRUE pip install minference --no-cache-dir

or build from source,

git clone https://github.com/microsoft/MInference/
cd MInference
pip install -e .