Unexpected RAM consumption when/after model load when using full offload to GPU
Unexpected Behavior In Detail
I just upgraded the lm-studio to 0.2.22 and I got it running with a Tesla M40, it’s got 12GB of VRAM and the previous downloaded _LLAMA3-8B-Q4_KM(~4GB) should be entirely load on VRAM without any issue, but it seems like even though I set the GPU offload to max, and the VRAM usage is normal(~5GB), and the speed of token splitting is much faster which means it should be running properly on the gpu, I’m curious about why it’s still consuming over HUGE AMOUNT of ram (over 4GB physical RAM & over 10-15GB swap/virtual ram) and it GOT WORSE/CRASHED when enabling Flash Attention and it just ran out of ram(8GB Physical & 17GB Swap) when there’s still plenty of VRAM on the gpu side when switching flash attention on
Logs & Screenshots Attached
Expected Behavior
Normal RAM & VRAM Usage, (RAM: 2-3GB->Same When IDLE VRAM: Depends On Model Size)
Screenshot Related INFO/ChatLog
-Well as we can see I attached 4 screenshots, two of them were before loading the model, vice versa
-*after offloading the entire model to gpu
-P.S. FYI the preset was the same as the built in llama3 preset except I switched the cpu threads value from 4 to 128 cuz I found it faster sometime back in few weeks ago when I had to run it on cpu only, and the result would be the same when switching to default value/preset if u would have question about that : )
And this is what happens when reloading the model after enabling flash attention
May LM-Studio continue to Flourish and Prosper.
Best Regards
PD-Kerman
Version
0.2.22
What went wrong
Unexpected RAM consumption when/after model load when using full offload to GPU
Unexpected Behavior In Detail
I just upgraded the lm-studio to 0.2.22 and I got it running with a Tesla M40, it’s got 12GB of VRAM and the previous downloaded _LLAMA3-8B-Q4_KM(~4GB) should be entirely load on VRAM without any issue, but it seems like even though I set the GPU offload to max, and the VRAM usage is normal(~5GB), and the speed of token splitting is much faster which means it should be running properly on the gpu, I’m curious about why it’s still consuming over HUGE AMOUNT of ram (over 4GB physical RAM & over 10-15GB swap/virtual ram) and it GOT WORSE/CRASHED when enabling Flash Attention and it just ran out of ram(8GB Physical & 17GB Swap) when there’s still plenty of VRAM on the gpu side when switching flash attention on Logs & Screenshots Attached
Expected Behavior
Normal RAM & VRAM Usage, (RAM: 2-3GB->Same When IDLE VRAM: Depends On Model Size)
Attachments
Log-1 Diagnostic Info
Log-2
main.log
Screenshots Combination
Screenshot Related INFO/ChatLog -Well as we can see I attached 4 screenshots, two of them were before loading the model, vice versa -*after offloading the entire model to gpu -P.S. FYI the preset was the same as the built in llama3 preset except I switched the cpu threads value from 4 to 128 cuz I found it faster sometime back in few weeks ago when I had to run it on cpu only, and the result would be the same when switching to default value/preset if u would have question about that : ) And this is what happens when reloading the model after enabling flash attention
May LM-Studio continue to Flourish and Prosper. Best Regards PD-Kerman