janhq / jan

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer. Multiple engine support (llama.cpp, TensorRT-LLM)
https://jan.ai/
GNU Affero General Public License v3.0
22.55k stars 1.3k forks source link

bug: GPU accelerated model fails to load without a visible error message #3761

Open sgdesmet opened 3 days ago

sgdesmet commented 3 days ago

Jan version

0.5.5

Describe the Bug

I'm running Fedora 40 on a laptop with a GTX 1050 Ti with 4Gb RAM. When enabling GPU acceleration and attempting to run models that are marked as 'Slow on your device' (such as Llama 3.2 3B Instruct Q8) , they fail to start without any visible error message. At first glance, the logs show what appears to be a memory issue:

2024-10-03T09:04:05.888Z [CORTEX]::Error: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1026.00 MiB on device 0: cudaMalloc failed: out of memory

Is it correct that my device is unable to run this particular model? If so, a 'Not enough VRAM' indicator when downloading the model and an explicit error message when starting the model would be my expectation.

Steps to Reproduce

  1. Install additional Cortex.cpp dependencies
  2. Enable GPU acceleration
  3. Download model Llama 3.2 3B Instruct Q8
  4. Start a new thread and enter some text
  5. Observe 'Starting model' loading indicator
  6. Nothing happens

Screenshots / Logs

app.log

What is your OS?

louis-jan commented 3 days ago

Related issue #3760

Asherathe commented 19 hours ago

Jan works well with small models, but larger models are unpredictable. With 16Gb VRAM and 32Gb RAM, sometimes a 34B model loads, sometimes it doesn't. Sometimes my whole system freezes and I have to restart, sometimes my GPU driver silently resets. Sometimes, everything works normally. Koboldcpp, Msty, LM Studio, etc. load the same model fine every time.