Closed ibehnam closed 10 months ago
Yes that's intentional. You have to build mlx from source to use that example until we release 0.0.11 (probably 1-2 days if you can wait 😄 )
@awni Okay, I built it from the source now and noticed that the speed (compared to llama.cpp) is less:
Prompt: 9.820 tokens-per-sec
Generation: 7.829 tokens-per-sec
Prompt: 14.28 tokens-per-sec
Generation: 19.70 tokens-per-sec
(M1 Pro chip, Sonoma)
Thanks for the benchmark.
Some comments:
Thanks @awni for clarifying. I like MLX and I hope the improvements you mentioned make it more attractive for devs. Maybe if MLX had something like llama.cpp/server, the warmup time wouldn't matter for the users. llama.cpp is trying to implement Flash Attention so it would get even faster, but I think MLX can make its own improvements esp. as a general ML framework for Apple silicon.
@awni Not sure if mlx-lm wants to integrate with server functionality, but I feel it can be useful for people who want a quick taste of mlx. I have an example on how to run an OpenAI-like API using mlx-lm. The implementation is straightforward. Maybe we can add some of those community examples in the readme so that people can try them out without having to download and build the mlx-example themselves.
Not sure if mlx-lm wants to integrate with server functionality, but I feel it can be useful for people who want a quick taste of mlx. I have an example on how to run an OpenAI-like API using mlx-lm
That's super cool! I'm not opposed to including it in mlx-lm
. It could be a convenient way to show how to load a model persistently. What do you think, does it make sense in mlx-lm
?
Maybe we can add some of those community examples in the readme so that people can try them out without having to download and build the mlx-example themselves.
Do you mean point to the community examples in the mlx-lm
README?
Yeah, personally, I think it makes sense. Since the mlx-lm is just a package we want to give people to try out mlx and remove barriers of entry, providing a built-in API would help them run llm via mlx-lm locally and integrate it with their workflow. For example, I am always running a llama.cpp server on my laptop using automator integration for quick grammar correction.
Do you mean point to the community examples in the mlx-lm README?
I mean the cool project you repost on Twitter can be included in the mlx-example readme, so people will be aware of those cool tools that the community has built on top of mlx and they can try them out.
Yeah, personally, I think it makes sense
Cool, I would be happy to include something like that. Would like to keep it pretty lightweight if possible though.
We could add a CLI like python -m mlx_lm.server
which provides essentially the generate API via HTTP. Are you interested in working on that?
I mean the cool project you repost on Twitter can be included in the mlx-example readme,
💯 got it, I like that idea.
Yeah, personally, I think it makes sense
Cool, I would be happy to include something like that. Would like to keep it pretty lightweight if possible though.
We could add a CLI like
python -m mlx_lm.server
which provides essentially the generate API via HTTP. Are you interested in working on that?I mean the cool project you repost on Twitter can be included in the mlx-example readme,
💯 got it, I like that idea.
Yeah, sure thing. I'm more than happy to work on that. :)
When trying the
gguf
example, I get: