Closed Iory1998 closed 3 months ago
Noted @Iory1998 will be addressed
This is now available in beta. Check out the #beta-releases-chat channel on discord
Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.
Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.
It is a parallel beta for the current release train. Available as of an hour ago
Thank you for your prompt response. Can I get a link here or an email since I don't use discord? On a different note, I already sent your team an email with some remarks about the beta 1 but hasn't heard from your team back. The email subject is "Feedback on LMS v0.3b Beta.
Never mind, I joined discord just to test the 0.31 beta 1
K and V quants for the context are still not available. Rolling back to pre 0.3 to get them back.
The difference is usable vs unusable for me on a 16GB GPU for llama 3.1 8B and Phi-medium. with the Q4 quants, the model fit and could look through the full context.
The new release takes 4 times the memory (and even with smaller cache still runs slower).
My request is to bring back the ability for the user to adjust the K and V context quants for Flash attention.
Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).
Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).
70
No, it was closed because the feature in being added. In the version 0.3.2, KV Cache is being set at FP8. I tested the beta, and you could have the KV cache set to Q4 and Q8, but it has not being added to the official LM Studio yet.
Hello team,
LM Studio is using recent updates in llama.cpp, which already has support for 4 bit and 8 bit cache, so I don't LM Studio does not incorporate it yet. The benefits are tremendous since it improves generation speed. It also helps with using a higher quantization.
The give you an example, I run aya-23-35B-Q4_K_M.gguf in LM Studio at a speed of 4.5t/s because the maximum number of layers I can load on my GPU with 24GB of VRAM is 30 layers. Aya has 41 layers. In Oobabooga Webui, with 4-bit cache enabled, I can load all layers in my Vram, and the speed bumps to 20.5t/s. That's a significant increase in performance (5 folds).
This should be your main priority since you are actually pushing your customer to move to a different platform. Right now, I don't LM Studio when I want to run a larger model, which is unfortunate since I am your biggest fan.
Please, solve this issue ASAP.