Closed Vhallo closed 5 months ago
Thank you! I updated to exllamav2-0.1.5+cu121.torch2.3.1
and indeed it got a significant speedup for long prompts with no Flash Attention, but I did not see a speedup with Flash Attention used. So it does not really change any conclusions, as it's still faster with FA and I'm still not aware of any downside of FA.
I updated the .csv results and the plots. Github caches images, so it will take ~10 minutes for the article to update. You can see the data changes in the commit and the new plots by directly clicking on them.
Just so you know, oobabooga's Text Generation WebUI currently still uses a by now quite outdated exl2 version. (0.0.20 vs the latest 0.1.5)
He's in the process of updating it though.