Is `mlx>=0.0.11` pushed to `pip` yet?

ibehnam commented 10 months ago

When trying the gguf example, I get:

ERROR: Could not find a version that satisfies the requirement mlx>=0.0.11 (from versions: 0.0.4.dev20231210, 0.0.4, 0.0.5.dev20231217, 0.0.5, 0.0.6.dev20231224, 0.0.6.dev20231231, 0.0.6, 0.0.7.dev202417, 0.0.7, 0.0.9.dev2024114, 0.0.9, 0.0.10.dev20240121, 0.0.10)
ERROR: No matching distribution found for mlx>=0.0.11

awni commented 10 months ago

Yes that's intentional. You have to build mlx from source to use that example until we release 0.0.11 (probably 1-2 days if you can wait 😄 )

ibehnam commented 10 months ago

@awni Okay, I built it from the source now and noticed that the speed (compared to llama.cpp) is less:

mlx (mistral Q4)

Prompt: 9.820 tokens-per-sec

Generation: 7.829 tokens-per-sec

llama.cpp (mistral Q8)

Prompt: 14.28 tokens-per-sec

Generation: 19.70 tokens-per-sec

(M1 Pro chip, Sonoma)

awni commented 10 months ago

Thanks for the benchmark.

Some comments:

I would not read too much into the prompt processing time as it's not a very good benchmark in our code (includes a bunch of warmup etc). If you run it twice it would be a lot faster (maybe 10x) the second time
Generation with LLMs is typically slower in MLX than llama.cpp. We are aware and also have a pretty good handle on why. The biggest reason is we haven't Implemented custom kernels for every op (RoPE, Layer Norm, RMS Norm, etc) and we are eating a lot of overhead cost there
MLX is designed to be a general framework that is fast for most numerical computing (not just LLMs), so there is some tension there between making it too bespoke and fast for LLMs (e.g. like llama.cpp) vs keeping it simple, hackable, and flexible
That said, we still aim to improve generation speed with a combination of general solutions (kernel fusion, compilation, etc) and possibly custom ops

ibehnam commented 10 months ago

Thanks @awni for clarifying. I like MLX and I hope the improvements you mentioned make it more attractive for devs. Maybe if MLX had something like llama.cpp/server, the warmup time wouldn't matter for the users. llama.cpp is trying to implement Flash Attention so it would get even faster, but I think MLX can make its own improvements esp. as a general ML framework for Apple silicon.

mzbac commented 10 months ago

@awni Not sure if mlx-lm wants to integrate with server functionality, but I feel it can be useful for people who want a quick taste of mlx. I have an example on how to run an OpenAI-like API using mlx-lm. The implementation is straightforward. Maybe we can add some of those community examples in the readme so that people can try them out without having to download and build the mlx-example themselves.

awni commented 10 months ago

Not sure if mlx-lm wants to integrate with server functionality, but I feel it can be useful for people who want a quick taste of mlx. I have an example on how to run an OpenAI-like API using mlx-lm

That's super cool! I'm not opposed to including it in mlx-lm. It could be a convenient way to show how to load a model persistently. What do you think, does it make sense in mlx-lm?

Maybe we can add some of those community examples in the readme so that people can try them out without having to download and build the mlx-example themselves.

Do you mean point to the community examples in the mlx-lm README?

mzbac commented 10 months ago

Yeah, personally, I think it makes sense. Since the mlx-lm is just a package we want to give people to try out mlx and remove barriers of entry, providing a built-in API would help them run llm via mlx-lm locally and integrate it with their workflow. For example, I am always running a llama.cpp server on my laptop using automator integration for quick grammar correction.

Do you mean point to the community examples in the mlx-lm README? I mean the cool project you repost on Twitter can be included in the mlx-example readme, so people will be aware of those cool tools that the community has built on top of mlx and they can try them out.

awni commented 10 months ago

Yeah, personally, I think it makes sense

Cool, I would be happy to include something like that. Would like to keep it pretty lightweight if possible though.

We could add a CLI like python -m mlx_lm.server which provides essentially the generate API via HTTP. Are you interested in working on that?

I mean the cool project you repost on Twitter can be included in the mlx-example readme,

💯 got it, I like that idea.

mzbac commented 10 months ago

Yeah, personally, I think it makes sense

Cool, I would be happy to include something like that. Would like to keep it pretty lightweight if possible though.

We could add a CLI like python -m mlx_lm.server which provides essentially the generate API via HTTP. Are you interested in working on that?

I mean the cool project you repost on Twitter can be included in the mlx-example readme,

💯 got it, I like that idea.

Yeah, sure thing. I'm more than happy to work on that. :)

ml-explore / mlx-examples

Is `mlx>=0.0.11` pushed to `pip` yet? #368