turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

Something is wrong about flash attention #358

Closed ParisNeo closed 2 weeks ago

ParisNeo commented 4 months ago
import flash_attn_2_cuda as flash_attn_cuda

ImportError: DLL load failed while importing flash_attn_2_cuda

turboderp commented 4 months ago

ExLlama doesn't import the flash_attn_2_cuda function directly, so this error seems to be generated by the flash-attn library. (?) I'd need more of the stack trace to say.

VldmrB commented 3 months ago

Does your flash-attn match your PyTorch?
Had the same error as I was using a pre-built flash-attn wheel on Windows, and had to rebuild my own after updating torch from 2.1 to 2.2.

ParisNeo commented 3 months ago

Hi. It worked if I recompile everything on my PC. But the problem is that since I am integrating this to my lollms app, users complain about failed compilation. I don't want to force them to install visual studio or bundle its install with my tool for windows and build essentials for linux which would result in a more complex install procedure.

I really wish there were precompiled wheels that they can just use. Since my project is 100% free, non sponsored and I am not some one with big resources, I can't afford having an automatic build system for all possible platform :(

turboderp commented 2 weeks ago

Tabby maintainer also has prebuilt flash-attn wheels for Windows, here.