Closed x4080 closed 2 months ago
Just looking at raw RAM used is not a great indicator as our allocator hogs memory in a cache even if it's not actively needed (yes this can be an issue in some cases, but should not cause the model to be slow, in fact often the opposite).
What you should look at is how the swap is growing and if the run time is screeching to a halt. If the swap is growing and growing and the time really slows down (e.g. beyond linear which may be expected from the attention context) then you are swapping actively needed memory and that is bad.
Is that happening for you or is it running pretty quick?
I just surprised that the memory is eaten so fast and suddenly the swap is @1gb, do you mean after finish it goes down to normal level ? If thats your question then I think so
So why the context taking so much memory ? I think that didnt happen for llama.cpp, is context size still using fp16 or even fp32 ?
Thanks for answering
This should be resolved in the latest MLX / MLX LM. Let us know if you still encounter memory use issues.
Hi, Thanks for the quantization now i can try phi2 in mlx
But I found out that when the phi2 is typing (longer context) - the memory usage is growing up quickly (I'm using m2 pro 16gb) - and realized it use swap up to 1GB
this is my prompt
Is this expected ?
Thanks