matt-c1 / llama-cpp-speed-measurements

Performance measurements of llama.cpp and exllamav2 on my machine.
Creative Commons Zero v1.0 Universal
5 stars 0 forks source link

Outdated EXL2 - oobabooga's Text Generation WebUI #1

Closed Vhallo closed 5 months ago

Vhallo commented 5 months ago

Just so you know, oobabooga's Text Generation WebUI currently still uses a by now quite outdated exl2 version. (0.0.20 vs the latest 0.1.5)

He's in the process of updating it though.

matt-c1 commented 5 months ago

Thank you! I updated to exllamav2-0.1.5+cu121.torch2.3.1 and indeed it got a significant speedup for long prompts with no Flash Attention, but I did not see a speedup with Flash Attention used. So it does not really change any conclusions, as it's still faster with FA and I'm still not aware of any downside of FA.

I updated the .csv results and the plots. Github caches images, so it will take ~10 minutes for the article to update. You can see the data changes in the commit and the new plots by directly clicking on them.