Closed Tedy50 closed 5 months ago
flash-attn needs to be version 2.2.1 or higher before it's used. Assuming you've got a later version, though, the benefits mostly show up at longer sequence lengths. In any case, for token-by-token inference at batch size 1, Flash Attention has never offered a huge benefit over regular dot-product attention.
is there a flag to enable flash attention2?
Yes it is enabled I tried to check and uncheck that checkbox in text generation UI also it seems to be loaded properly. as it required update of cuda
I test it with context lengths over 10000 tokens but memory usage is precisely the same. there is not a slightest difference in memory usage or generation speed.
Is it enabled by default or does it need to be set when loading model? @Tedy50 @turboderp
You can disable it no_flash_attn = True
in the config or with -nfa
on the command line. Otherwise it's automatically used if it's installed (i.e. if the flash_attn
package is found).
I can't notice any difference between normal and flash attention memory usage is exactly the same performance also looks the same. Same results on windows and WSL