Open gxx777 opened 11 months ago
Getting the same, going to try some of the smaller models from huggingface and see how it goes.
Getting the same, going to try some of the smaller models from huggingface and see how it goes.
Isn't Llama-2-7b-chat
already the smallest official Llama 2 one can get from Hugging Face?
right now I am trying to generate stable-diffusion on a 8gb m2 pro. only god can help me now.
It appears that the mlx framework, particularly LLaMA and Stable Diffusion, demands significant memory and processing resources, making it challenging to run efficiently on machines with 16GB of RAM, such as the M1 Pro.
The memory consumption reaching around 13GB for LLaMA and 11GB for Stable Diffusion on your specific configuration indicates potential limitations due to resource-intensive operations. This usage pattern might not be optimal for systems with 16GB RAM, causing performance issues like long processing times and high memory consumption.
Consider optimizing your workflows or exploring alternative frameworks that might better suit your system's resources. Additionally, reaching out to the framework's developers or community for insights on potential optimizations or alternative configurations tailored to your machine could be beneficial.
So for LLama and Mistral 32GB is plenty and probably 24 is also fine. I measured the peak memory use at around 16 GB so a 16GB machine would be on the small side and swapping likely explains why you are seeing such horrible perf. This is something that we have quite a bit of runway to improve though.
Since the model size is most of the memory (7B params is about 13GB in half-precision) quantization is probably the biggest lever at the moment. And we are prioritizing it accordingly. We will basically reduce memory use by half with 8 bit quantization and by 1/4 with 4 bit ... so in the near future 16 GB machine should be very practical.
So for LLama and Mistral 32GB is plenty and probably 24 is also fine. I measured the peak memory use at around 16 GB so a 16GB machine would be on the small side and swapping likely explains why you are seeing such horrible perf. This is something that we have quite a bit of runway to improve though.
Since the model size is most of the memory (7B params is about 13GB in half-precision) quantization is probably the biggest lever at the moment. And we are prioritizing it accordingly. We will basically reduce memory use by half with 8 bit quantization and by 1/4 with 4 bit ... so in the near future 16 GB machine should be very practical.
Why not make 8gb models work too like --lowvram --medvram --lowram. and as this is metal and it is utilizing the same memory pool so basically we have 8gb of ram and also 8gb of vram both simultaneously working. RIGHT?? Obviously there will be a sacrifice of speed in generation but maybe we can use the swap memory to make it better.
So for LLama and Mistral 32GB is plenty and probably 24 is also fine. I measured the peak memory use at around 16 GB so a 16GB machine would be on the small side and swapping likely explains why you are seeing such horrible perf. This is something that we have quite a bit of runway to improve though.
Since the model size is most of the memory (7B params is about 13GB in half-precision) quantization is probably the biggest lever at the moment. And we are prioritizing it accordingly. We will basically reduce memory use by half with 8 bit quantization and by 1/4 with 4 bit ... so in the near future 16 GB machine should be very practical.
Cant wait for this implemented. Will it inference faster than llama cpp ?
You just make me realised that these guy doing the mlx-lm fine-tuning demos are using M3Max, which is drastically different than my humble 16GB Mac...
Hello, can you provide a minimum configuration for model usage?
macOS 13.4.1 14-inch M1 Pro 16GB
The memory consumption reaches around 13GB.
The memory consumption reaches around 11GB , and it takes more than ten mins.
Unfortunately, given these observations, it seems that the
mlx
framework is almost unavailable for machines with 16GB M1 Pro.