vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.11k stars 3.12k forks source link

[Feature]: vllm-flash-attn cu118 compatibility #5232

Open epark001 opened 4 weeks ago

epark001 commented 4 weeks ago

🚀 The feature, motivation and pitch

vllm-flash-attn seems like it currently does not support cu118:

>>> import vllm_flash_attn
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/pyenv/versions/3.10.9/lib/python3.10/site-packages/vllm_flash_attn/__init__.py", line 3, in <module>
    from vllm_flash_attn.flash_attn_interface import (
  File "/opt/pyenv/versions/3.10.9/lib/python3.10/site-packages/vllm_flash_attn/flash_attn_interface.py", line 10, in <module>
    import vllm_flash_attn_2_cuda as flash_attn_cuda
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

flash-attn seems to support cu118 on the original project and vllm supports cu118 so vllm-flash-attn cu118 version would be helpful

Alternatives

No response

Additional context

No response

thangld201 commented 3 weeks ago

Me too, anyone knows what change is needed to bring back cuda11.8 ? Maybe I can test locally first

zhaotyer commented 3 weeks ago

me too

zhaotyer commented 3 weeks ago

I build cuda118 version from https://github.com/vllm-project/flash-attention/tree/v2.5.8.post2 source code you can download from https://github.com/zhaotyer/vllm_whl_repo/blob/master/vllm_flash_attn-2.5.8.post2-cp38-cp38-linux_x86_64.whl

heianzhihuo commented 4 days ago

I build cuda118 version from https://github.com/vllm-project/flash-attention/tree/v2.5.8.post2 source code you can download from https://github.com/zhaotyer/vllm_whl_repo/blob/master/vllm_flash_attn-2.5.8.post2-cp38-cp38-linux_x86_64.whl

could you build a cu118 python39 version? many thanks

hasakikiki commented 2 days ago

@heianzhihuo have you got whl with cu118 and python39, i'm looking for it, tooo

anlongfei commented 1 day ago

I build cuda118 version from https://github.com/vllm-project/flash-attention/tree/v2.5.8.post2 source code you can download from https://github.com/zhaotyer/vllm_whl_repo/blob/master/vllm_flash_attn-2.5.8.post2-cp38-cp38-linux_x86_64.whl

could you build a cu118 python39 version? many thanks

i'm looking for it, toooooooooo