Outdated EXL2 - oobabooga's Text Generation WebUI

Thank you! I updated to exllamav2-0.1.5+cu121.torch2.3.1 and indeed it got a significant speedup for long prompts with no Flash Attention, but I did not see a speedup with Flash Attention used. So it does not really change any conclusions, as it's still faster with FA and I'm still not aware of any downside of FA.

I updated the .csv results and the plots. Github caches images, so it will take ~10 minutes for the article to update. You can see the data changes in the commit and the new plots by directly clicking on them.

matt-c1 / llama-cpp-speed-measurements

Outdated EXL2 - oobabooga's Text Generation WebUI #1