Ollama not using 20GB of VRAM from Tesla P40 card

ollama / ollama

Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models.

https://ollama.com

MIT License

89.83k stars 7.05k forks source link

Ollama not using 20GB of VRAM from Tesla P40 card #6456

Open Happydragun4now opened 4 weeks ago

Happydragun4now commented 4 weeks ago

What is the issue?

Not sure if this is a bug, damaged hardware, or a driver issue but I thought I would report it just in case. Ollama sees 23.7GB available on each card when it detects them, but then only 3.7 when it's trying to allocate memory. From the server logs:

time=2024-08-21T17:49:38.582-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-92d1b3ad-0ab8-2ece-050e-b4f5252f8098 library=cuda compute=6.1 driver=12.6 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"

time=2024-08-21T17:49:38.582-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-7a8dc17e-85e1-5bc8-e230-119d6be5252c library=cuda compute=6.1 driver=12.6 name="Tesla P40" total="23.9 GiB" available="23.7 GiB"

layers.requested=-1 layers.model=81 layers.offload=48 layers.split=3,45 memory.available="[3.7 GiB 23.7 GiB]"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.6

rick-github commented 4 weeks ago

What's the output of nvidia-smi? If you can include a complete log, it may include details that give a better understanding of what's going on.

Happydragun4now commented 4 weeks ago

I'm on windows BTW. I ruled out a hardware issue because when I change the order of the devices in CUDA_VISIBLE_DEVICES it changes which card loads 20GB I am still not sure if it's a driver issue but I have tried CUDA 11.7, 12.4, 12.6, as well as a few different server drivers for the P40's, and I have tried Ollama 0.3.6 and 0.3.7

Sorry I can't get a screenshot right now but SMI shows the same as this: llm_load_tensors: CUDA0 buffer size = 920.12 MiB llm_load_tensors: CUDA1 buffer size = 21536.62 MiB One card will load 20GB and the other will load around 1GB.

Here's the full logs server1.log

rick-github commented 4 weeks ago

The reason I asked for the output of nvidia-smi is because it shows what processes are using the GPU. The log shows that one of the GPUs has only 3.6GiB free:

time=2024-08-21T16:26:53.736-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 
layers.model=81 layers.offload=46 layers.split=2,44 memory.available="[3.6 GiB 23.7 GiB]" 
memory.required.full="44.7  GiB" memory.required.partial="26.4 GiB" memory.required.kv="640.0 MiB"
memory.required.allocations="[3.1 GiB 23.3 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB"
memory.weights.nonrepeating="822.0 MiB"  memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"

If you can identify what is using 20G of one of your cards, you might be able to free up some VRAM for model loading.

Happydragun4now commented 4 weeks ago

as far as I could tell nothing was using it, it showed something like 1600/24000M. maybe it's getting reserved by something or Ollama/CUDA isn't reading it properly?

I found this image from when I was attempting to find a fix last night, sorry it doesn't have the processes

Happydragun4now commented 3 weeks ago

This seemed to be due to the Quadro K2200, disabling it in windows made the model load properly across the 2 P40's.

I have CUDA_VISIBLE_DEVICES set to the UUIDs of the P40's, so the Quadro shouldn't be detected but maybe CUDA was confusing their available VRAM between the 2 cards?

Not sure if you want the ticket for investigation or would like to close it, but thanks for taking the time to look at this