Unexpected RAM/VRAM Consumption

Version

0.2.22

What went wrong

Unexpected RAM consumption when/after model load when using full offload to GPU

Unexpected Behavior In Detail

I just upgraded the lm-studio to 0.2.22 and I got it running with a Tesla M40, it’s got 12GB of VRAM and the previous downloaded _LLAMA3-8B-Q4_KM(~4GB) should be entirely load on VRAM without any issue, but it seems like even though I set the GPU offload to max, and the VRAM usage is normal(~5GB), and the speed of token splitting is much faster which means it should be running properly on the gpu, I’m curious about why it’s still consuming over HUGE AMOUNT of ram (over 4GB physical RAM & over 10-15GB swap/virtual ram) and it GOT WORSE/CRASHED when enabling Flash Attention and it just ran out of ram(8GB Physical & 17GB Swap) when there’s still plenty of VRAM on the gpu side when switching flash attention on Logs & Screenshots Attached

Expected Behavior

Normal RAM & VRAM Usage, (RAM: 2-3GB->Same When IDLE VRAM: Depends On Model Size)

Attachments

Log-1 Diagnostic Info

{
  "cause": "(Exit code: 0). Some model operation failed. Try a different model and/or config.",
  "suggestion": "",
  "data": {
    "memory": {
      "ram_capacity": "7.88 GB",
      "ram_unused": "832.32 MB"
    },
    "gpu": {
      "type": "NvidiaCuda",
      "vram_recommended_capacity": "12.00 GB",
      "vram_unused": "11.10 GB"
    },
    "os": {
      "platform": "win32",
      "version": "10.0.22631",
      "supports_avx2": true
    },
    "app": {
      "version": "0.2.22",
      "downloadsDir": "E:\\LM_STUDIO_M_ARC"
    },
    "model": {}
  },
  "title": "Error loading model."
}

Log-2

main.log

Screenshots Combination

Screenshot Related INFO/ChatLog -Well as we can see I attached 4 screenshots, two of them were before loading the model, vice versa -*after offloading the entire model to gpu -P.S. FYI the preset was the same as the built in llama3 preset except I switched the cpu threads value from 4 to 128 cuz I found it faster sometime back in few weeks ago when I had to run it on cpu only, and the result would be the same when switching to default value/preset if u would have question about that : ) And this is what happens when reloading the model after enabling flash attention

May LM-Studio continue to Flourish and Prosper. Best Regards PD-Kerman

lmstudio-ai / lmstudio-bug-tracker