lmstudio-ai / lms

👾 LM Studio CLI
https://lms.dev
MIT License
1.68k stars 138 forks source link

[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

Closed Iory1998 closed 3 months ago

Iory1998 commented 3 months ago

Hello team,

LM Studio is using recent updates in llama.cpp, which already has support for 4 bit and 8 bit cache, so I don't LM Studio does not incorporate it yet. The benefits are tremendous since it improves generation speed. It also helps with using a higher quantization.

The give you an example, I run aya-23-35B-Q4_K_M.gguf in LM Studio at a speed of 4.5t/s because the maximum number of layers I can load on my GPU with 24GB of VRAM is 30 layers. Aya has 41 layers. In Oobabooga Webui, with 4-bit cache enabled, I can load all layers in my Vram, and the speed bumps to 20.5t/s. That's a significant increase in performance (5 folds).

This should be your main priority since you are actually pushing your customer to move to a different platform. Right now, I don't LM Studio when I want to run a larger model, which is unfortunate since I am your biggest fan.

Please, solve this issue ASAP.

yagil commented 3 months ago

Noted @Iory1998 will be addressed

yagil commented 3 months ago

This is now available in beta. Check out the #beta-releases-chat channel on discord

Iory1998 commented 3 months ago

Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.

yagil commented 3 months ago

Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.

It is a parallel beta for the current release train. Available as of an hour ago

Iory1998 commented 3 months ago

Thank you for your prompt response. Can I get a link here or an email since I don't use discord? On a different note, I already sent your team an email with some remarks about the beta 1 but hasn't heard from your team back. The email subject is "Feedback on LMS v0.3b Beta.

Iory1998 commented 3 months ago

Never mind, I joined discord just to test the 0.31 beta 1

GabeAl commented 2 months ago

K and V quants for the context are still not available. Rolling back to pre 0.3 to get them back.

The difference is usable vs unusable for me on a 16GB GPU for llama 3.1 8B and Phi-medium. with the Q4 quants, the model fit and could look through the full context.

The new release takes 4 times the memory (and even with smaller cache still runs slower).

My request is to bring back the ability for the user to adjust the K and V context quants for Flash attention.

GabeAl commented 2 months ago

Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).

https://github.com/lmstudio-ai/lms/issues/70

Iory1998 commented 2 months ago

Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).

70

No, it was closed because the feature in being added. In the version 0.3.2, KV Cache is being set at FP8. I tested the beta, and you could have the KV cache set to Q4 and Q8, but it has not being added to the official LM Studio yet.