turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.54k stars 274 forks source link

flash attention does nothing #199

Closed Tedy50 closed 5 months ago

Tedy50 commented 10 months ago

I can't notice any difference between normal and flash attention memory usage is exactly the same performance also looks the same. Same results on windows and WSL

turboderp commented 10 months ago

flash-attn needs to be version 2.2.1 or higher before it's used. Assuming you've got a later version, though, the benefits mostly show up at longer sequence lengths. In any case, for token-by-token inference at batch size 1, Flash Attention has never offered a huge benefit over regular dot-product attention.

ezra-ch commented 10 months ago

is there a flag to enable flash attention2?

Tedy50 commented 10 months ago

Yes it is enabled I tried to check and uncheck that checkbox in text generation UI also it seems to be loaded properly. as it required update of cuda

I test it with context lengths over 10000 tokens but memory usage is precisely the same. there is not a slightest difference in memory usage or generation speed.

Rajmehta123 commented 9 months ago

Is it enabled by default or does it need to be set when loading model? @Tedy50 @turboderp

turboderp commented 9 months ago

You can disable it no_flash_attn = True in the config or with -nfa on the command line. Otherwise it's automatically used if it's installed (i.e. if the flash_attn package is found).