ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.8k stars 824 forks source link

Huge memory usage (even in 4bit) #182

Closed x4080 closed 3 weeks ago

x4080 commented 8 months ago

Hi, Thanks for the quantization now i can try phi2 in mlx

But I found out that when the phi2 is typing (longer context) - the memory usage is growing up quickly (I'm using m2 pro 16gb) - and realized it use swap up to 1GB

this is my prompt

python phi2.py --model-path ./models/phi2 --max-tokens 1024 --temp 0.7 --prompt "Question : Is your name is phil
Answer:"

Is this expected ?

Thanks

awni commented 8 months ago

Just looking at raw RAM used is not a great indicator as our allocator hogs memory in a cache even if it's not actively needed (yes this can be an issue in some cases, but should not cause the model to be slow, in fact often the opposite).

What you should look at is how the swap is growing and if the run time is screeching to a halt. If the swap is growing and growing and the time really slows down (e.g. beyond linear which may be expected from the attention context) then you are swapping actively needed memory and that is bad.

Is that happening for you or is it running pretty quick?

x4080 commented 8 months ago

I just surprised that the memory is eaten so fast and suddenly the swap is @1gb, do you mean after finish it goes down to normal level ? If thats your question then I think so

So why the context taking so much memory ? I think that didnt happen for llama.cpp, is context size still using fp16 or even fp32 ?

Thanks for answering

awni commented 3 weeks ago

This should be resolved in the latest MLX / MLX LM. Let us know if you still encounter memory use issues.