ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.83k stars 828 forks source link

Performance with M1 Pro 16GB: Is it Normal? #38

Open gxx777 opened 9 months ago

gxx777 commented 9 months ago

Hello, can you provide a minimum configuration for model usage?

macOS 13.4.1 14-inch M1 Pro 16GB

  1. LLaMA. It takes so long time to chat and not available in fact.
    (mlx) llama % python3 llama.py Llama-2-7b-chat.npz tokenizer.model "who are you?"
    [INFO] Loading model from disk.
    Press enter to start generation
    ------

    The memory consumption reaches around 13GB.

  2. Stable Diffusion
    (mlx) stable_diffusion % python3 txt2image.py "a beautiful flower" --output flower.png
    2%|█▊                                                                                       | 1/50 [00:20<16:55, 20.72s/it]

    The memory consumption reaches around 11GB , and it takes more than ten mins.

Unfortunately, given these observations, it seems that the mlx framework is almost unavailable for machines with 16GB M1 Pro.

Digimonger commented 9 months ago

Getting the same, going to try some of the smaller models from huggingface and see how it goes.

bbelescot commented 9 months ago

Getting the same, going to try some of the smaller models from huggingface and see how it goes.

Isn't Llama-2-7b-chat already the smallest official Llama 2 one can get from Hugging Face?

saminatorkash commented 9 months ago

right now I am trying to generate stable-diffusion on a 8gb m2 pro. only god can help me now.

alikhan-tech commented 9 months ago

It appears that the mlx framework, particularly LLaMA and Stable Diffusion, demands significant memory and processing resources, making it challenging to run efficiently on machines with 16GB of RAM, such as the M1 Pro.

The memory consumption reaching around 13GB for LLaMA and 11GB for Stable Diffusion on your specific configuration indicates potential limitations due to resource-intensive operations. This usage pattern might not be optimal for systems with 16GB RAM, causing performance issues like long processing times and high memory consumption.

Consider optimizing your workflows or exploring alternative frameworks that might better suit your system's resources. Additionally, reaching out to the framework's developers or community for insights on potential optimizations or alternative configurations tailored to your machine could be beneficial.

awni commented 9 months ago

So for LLama and Mistral 32GB is plenty and probably 24 is also fine. I measured the peak memory use at around 16 GB so a 16GB machine would be on the small side and swapping likely explains why you are seeing such horrible perf. This is something that we have quite a bit of runway to improve though.

Since the model size is most of the memory (7B params is about 13GB in half-precision) quantization is probably the biggest lever at the moment. And we are prioritizing it accordingly. We will basically reduce memory use by half with 8 bit quantization and by 1/4 with 4 bit ... so in the near future 16 GB machine should be very practical.

saminatorkash commented 9 months ago

So for LLama and Mistral 32GB is plenty and probably 24 is also fine. I measured the peak memory use at around 16 GB so a 16GB machine would be on the small side and swapping likely explains why you are seeing such horrible perf. This is something that we have quite a bit of runway to improve though.

Since the model size is most of the memory (7B params is about 13GB in half-precision) quantization is probably the biggest lever at the moment. And we are prioritizing it accordingly. We will basically reduce memory use by half with 8 bit quantization and by 1/4 with 4 bit ... so in the near future 16 GB machine should be very practical.

Why not make 8gb models work too like --lowvram --medvram --lowram. and as this is metal and it is utilizing the same memory pool so basically we have 8gb of ram and also 8gb of vram both simultaneously working. RIGHT?? Obviously there will be a sacrifice of speed in generation but maybe we can use the swap memory to make it better.

x4080 commented 9 months ago

So for LLama and Mistral 32GB is plenty and probably 24 is also fine. I measured the peak memory use at around 16 GB so a 16GB machine would be on the small side and swapping likely explains why you are seeing such horrible perf. This is something that we have quite a bit of runway to improve though.

Since the model size is most of the memory (7B params is about 13GB in half-precision) quantization is probably the biggest lever at the moment. And we are prioritizing it accordingly. We will basically reduce memory use by half with 8 bit quantization and by 1/4 with 4 bit ... so in the near future 16 GB machine should be very practical.

Cant wait for this implemented. Will it inference faster than llama cpp ?

fangyuan-ksgk commented 8 months ago

You just make me realised that these guy doing the mlx-lm fine-tuning demos are using M3Max, which is drastically different than my humble 16GB Mac...