lmstudio-ai / lmstudio-bug-tracker

Bug tracking for the LM Studio desktop application
2 stars 2 forks source link

Unexpected RAM/VRAM Consumption #10

Open PD-Kerman opened 1 month ago

PD-Kerman commented 1 month ago

Version

0.2.22

What went wrong

Unexpected RAM consumption when/after model load when using full offload to GPU

Unexpected Behavior In Detail

I just upgraded the lm-studio to 0.2.22 and I got it running with a Tesla M40, it’s got 12GB of VRAM and the previous downloaded _LLAMA3-8B-Q4_KM(~4GB) should be entirely load on VRAM without any issue, but it seems like even though I set the GPU offload to max, and the VRAM usage is normal(~5GB), and the speed of token splitting is much faster which means it should be running properly on the gpu, I’m curious about why it’s still consuming over HUGE AMOUNT of ram (over 4GB physical RAM & over 10-15GB swap/virtual ram) and it GOT WORSE/CRASHED when enabling Flash Attention and it just ran out of ram(8GB Physical & 17GB Swap) when there’s still plenty of VRAM on the gpu side when switching flash attention on Logs & Screenshots Attached

Expected Behavior

Normal RAM & VRAM Usage, (RAM: 2-3GB->Same When IDLE VRAM: Depends On Model Size)

Attachments

Log-1 Diagnostic Info

{
  "cause": "(Exit code: 0). Some model operation failed. Try a different model and/or config.",
  "suggestion": "",
  "data": {
    "memory": {
      "ram_capacity": "7.88 GB",
      "ram_unused": "832.32 MB"
    },
    "gpu": {
      "type": "NvidiaCuda",
      "vram_recommended_capacity": "12.00 GB",
      "vram_unused": "11.10 GB"
    },
    "os": {
      "platform": "win32",
      "version": "10.0.22631",
      "supports_avx2": true
    },
    "app": {
      "version": "0.2.22",
      "downloadsDir": "E:\\LM_STUDIO_M_ARC"
    },
    "model": {}
  },
  "title": "Error loading model."
}

Log-2

main.log

Screenshots Combination

F44EEF46F9516D46B43935E5F33B9185 5332703069BF4E611DE1E84024756ABF 960FE6AD-315D-4DDC-B804-A57CBEF0E1D7 3E29C49E-84E1-460B-8CED-B2702A9F806F 1C0A8499-A2ED-4064-B0B6-F156D73F9FF0 31D11C8F-E830-48DD-8893-7FFFD376CE8A C70455D6-E18E-4026-A397-7A9FCD57ECA2 CAF8A6495E20BFAA50B3483A8E0CB88B B2D64A0A5626D239EEBFB0BAF5FB7690

Screenshot Related INFO/ChatLog -Well as we can see I attached 4 screenshots, two of them were before loading the model, vice versa -*after offloading the entire model to gpu -P.S. FYI the preset was the same as the built in llama3 preset except I switched the cpu threads value from 4 to 128 cuz I found it faster sometime back in few weeks ago when I had to run it on cpu only, and the result would be the same when switching to default value/preset if u would have question about that : ) And this is what happens when reloading the model after enabling flash attention

May LM-Studio continue to Flourish and Prosper. Best Regards PD-Kerman