Quantisation Support - Githubissues

okpatil4u commented 5 months ago

Hello there, great work !

I was checking if models with int8, int5, int4 quant formats can be used with this package ? Could you please create an example if possible ?

tanliboy commented 5 months ago

+1 The inference speed is very impressive. Amazing work! It will be awesome to run quantized model with it.

robertritz commented 5 months ago

Yes! If quantized models are supported I will be moving away from llama.cpp and over to MLX. Prompt evaluation is still tediously slow on M series with llama.cpp and it seems that MLX will speed this up greatly.

tanliboy commented 5 months ago

it seems that MLX will speed this up greatly. I am not sure about this and look for more data points. I tested llama.cpp and mlx on the 16 bits llama 7b chat model on M1, and they had similar performance.

robertritz commented 5 months ago

I don't think it will improve overall tokens per second. But time to first token is what the issue is right now.

For large prompts it can take almost 10 seconds to get a single token output.

okpatil4u commented 5 months ago

Are you sure ? For me prompt evaluation was almost instantaneous. I used Joe Biden’s entire wikipedia intro and it generated the early life section in no time.

On Thu, 7 Dec 2023 at 2:00 PM, Robert Ritz @.***> wrote:

I don't think it will improve overall tokens per second. But time to first token is what the issue is right now.

For large prompts it can take almost 10 seconds to get a response.

— Reply to this email directly, view it on GitHub https://github.com/ml-explore/mlx/issues/13#issuecomment-1844888419, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4H5O3VB52HMWAYZ6FLYIF5EFAVCNFSM6AAAAABAI3MUPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBUHA4DQNBRHE . You are receiving this because you authored the thread.Message ID: @.***>

robertritz commented 5 months ago

To be clear I'm referring to llama.cpp (I'm using the Python bindings). For long prompts, it takes several seconds for me to get a response. Are you referring to llama.cpp?

okpatil4u commented 5 months ago

Aah, I was saying that mlx is faster. Yes, llama.cpp is very slow for evaluation.

On Thu, 7 Dec 2023 at 2:53 PM, Robert Ritz @.***> wrote:

To be clear I'm referring to llama.cpp (I'm using the Python bindings). For long prompts, it takes several seconds for me to get a response. Are you referring to llama.cpp?

— Reply to this email directly, view it on GitHub https://github.com/ml-explore/mlx/issues/13#issuecomment-1844981878, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4BVJCHB56FOUKNPURTYIGDJVAVCNFSM6AAAAABAI3MUPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBUHE4DCOBXHA . You are receiving this because you authored the thread.Message ID: @.***>

robertritz commented 5 months ago

Yep that's why I'm so interested in mlx. Slow evaluation is the main shortcoming of local LLMs on Macs right now.

tanliboy commented 5 months ago

That is pretty interesting. @robertritz, do you have data points of your examples on the evaluation latency? Is your prompt templated? If yes, have you tried tokenizing some of them ahead of time and using the /infill mode? What kinds of quantization algorithms are you using/looking for? Yes, I am thinking of creating an OS library on top of mlx for that, if it is not on the mlx team's roadmap yet.

robertritz commented 5 months ago

@tanliboy I haven't done any testing that was recorded. But here is my anecdotal experience. When a model is first loaded with llama.cpp (not kept in RAM but lazy loaded) the first message takes at least 1-2 seconds for short prompts. Follow up questions are quicker.

I'm not sure why this is re the internals of llama.cpp, but it's what I have experienced. The longer the initial prompt the longer the delay.

I'm mostly using 4 bit quantizations of 7B and 13B models locally on M series Macs. I would also love to have support for BERT models so Sentence Transformer models could work efficiently without needing the entire torch library installed (300+ MB).

But I also realize this is no easy task.

vade commented 5 months ago

WRT to Llama.cpp startup - It could be initial CoreML model compilation. GGML (the Llama.cpp runtime) can leverage the ANE, and I think (?) via the that tokenizer can run on CoreML, just like Whisper, and by default its enabled. Other paths run Metal / CPU via GGML tensor ops by default iirc since doing key value caching in CoreML isnt easily doable.

Whisper.cpp can have the same issue depending on how its compile flags are set and what model is being used internally?

Sorry if thats off topic.

MikeyBeez commented 4 months ago

You can use quantized models with Ollama, but the quality is poor. https://www.youtube.com/watch?v=7BH4C6-HP14

angeloskath commented 4 months ago

Closing as we now support quantized models and it is as easy as nn.QuantizedLinear.quantize_module(model). There are also a lot of quantized models in the http://github.com/ml-explore/mlx-examples/ .

ml-explore / mlx

Quantisation Support #13