ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
91.43k stars 7.19k forks source link

Scheduler should respect main_gpu on multi-gpu setup #6493

Open henryclw opened 1 month ago

henryclw commented 1 month ago

What is the issue?

The main_gpu option is not working as expected.

My system has two GPUs. I've sent the request to /api/chat

{
    "model": "llama3.1:8b-instruct-q8_0",
    "messages": [
        {
            "role": "user",
            "content": "What is the color of our sky?"
        }
    ],
    "stream": false,
    "keep_alive": -1,
    "options": {
        "use_mmap": false,
        "main_gpu": 1
    }
}

Expected behavior: the model is loaded on my second gpu (i.e. gpu 1) Actual behavior: the model is always loaded on my first gpu (i.e. gpu 0), no matter how the main_gpu is (whether it's 0 or 1) P.S. This model could be fit itself in any of my gpu, one gpu is enough to load all the weights.

I do know that I could set the CUDA_VISIBLE_DEVICES to specify which gpu to use, as #1813 suggested. But using an environment variable is not as flexible as the parameter in the request (which could be adjust in each different request)

Maybe this parameter is not correctly passed to llama.cpp or llama.cpp is not selecting the gpu as we expected?

OS

Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.3.6

rick-github commented 1 month ago

Server logs may shed light on why main_gpu is not acting as expected.

henryclw commented 1 month ago
2024/08/24 22:31:35 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-24T22:31:35.687Z level=INFO source=images.go:782 msg="total blobs: 62"
time=2024-08-24T22:31:35.946Z level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-24T22:31:36.076Z level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6)"
time=2024-08-24T22:31:36.077Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3948064601/runners
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/deps.txt.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/ollama_llama_server.gz
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx2/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/rocm_v60102/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60102 cpu cpu_avx]"
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-08-24T22:31:39.255Z level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-24T22:31:39.255Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-24T22:31:39.255Z level=DEBUG source=gpu.go:90 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-08-24T22:31:39.255Z level=DEBUG source=gpu.go:472 msg="Searching for GPU library" name=libcuda.so*
time=2024-08-24T22:31:39.255Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-08-24T22:31:39.256Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/wsl/drivers/nv_dispi.inf_amd64_78cd02ab022cd554/libcuda.so.1.1]"
CUDA driver version: 12.6
time=2024-08-24T22:31:39.296Z level=DEBUG source=gpu.go:123 msg="detected GPUs" count=2 library=/usr/lib/x86_64-linux-gnu/libcuda.so.1
[GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898] CUDA totalMem 12287 mb
[GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898] CUDA freeMem 11247 mb
[GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898] Compute Capability 8.6
[GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015] CUDA totalMem 12287 mb
[GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015] CUDA freeMem 11247 mb
[GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015] Compute Capability 8.6
time=2024-08-24T22:31:39.601Z level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
time=2024-08-24T22:31:53.914Z level=DEBUG source=gpu.go:362 msg="updating system memory data" before.total="31.3 GiB" before.free="26.8 GiB" before.free_swap="16.0 GiB" now.total="31.3 GiB" now.free="26.1 GiB" now.free_swap="16.0 GiB"
CUDA driver version: 12.6
time=2024-08-24T22:31:54.037Z level=DEBUG source=gpu.go:410 msg="updating cuda memory data" gpu=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="1.0 GiB"
time=2024-08-24T22:31:54.167Z level=DEBUG source=gpu.go:410 msg="updating cuda memory data" gpu=GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="1.0 GiB"
releasing cuda driver library
time=2024-08-24T22:31:54.167Z level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x8161c0 gpu_count=2
time=2024-08-24T22:31:54.211Z level=DEBUG source=sched.go:219 msg="loading first model" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d gpu=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 parallel=4 available=11793334272 required="9.7 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=server.go:101 msg="system memory" total="31.3 GiB" free="26.1 GiB" free_swap="16.0 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[11.0 GiB]" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[9.7 GiB]" memory.weights.total="7.9 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx2/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/rocm_v60102/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx2/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/rocm_v60102/ollama_llama_server
time=2024-08-24T22:31:54.213Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --main-gpu 1 --no-mmap --parallel 4 --port 39689"
time=2024-08-24T22:31:54.213Z level=DEBUG source=server.go:410 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama3948064601/runners/cuda_v11:/tmp/ollama3948064601/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898]"
time=2024-08-24T22:31:54.213Z level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-24T22:31:54.213Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-24T22:31:54.214Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="1e6f655" tid="140093911531520" timestamp=1724538714
INFO [main] system info | n_threads=12 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140093911531520" timestamp=1724538714 total_threads=24
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="39689" tid="140093911531520" timestamp=1724538714
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2024-08-24T22:31:54.465Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.34 MiB
time=2024-08-24T22:31:56.723Z level=DEBUG source=server.go:638 msg="model load progress 0.07"
time=2024-08-24T22:31:57.977Z level=DEBUG source=server.go:638 msg="model load progress 0.13"
time=2024-08-24T22:31:58.228Z level=DEBUG source=server.go:638 msg="model load progress 0.14"
time=2024-08-24T22:31:58.479Z level=DEBUG source=server.go:638 msg="model load progress 0.15"
time=2024-08-24T22:31:58.730Z level=DEBUG source=server.go:638 msg="model load progress 0.16"
time=2024-08-24T22:31:58.981Z level=DEBUG source=server.go:638 msg="model load progress 0.18"
time=2024-08-24T22:31:59.231Z level=DEBUG source=server.go:638 msg="model load progress 0.19"
time=2024-08-24T22:31:59.482Z level=DEBUG source=server.go:638 msg="model load progress 0.21"
time=2024-08-24T22:31:59.733Z level=DEBUG source=server.go:638 msg="model load progress 0.22"
time=2024-08-24T22:31:59.984Z level=DEBUG source=server.go:638 msg="model load progress 0.23"
time=2024-08-24T22:32:00.235Z level=DEBUG source=server.go:638 msg="model load progress 0.24"
time=2024-08-24T22:32:00.485Z level=DEBUG source=server.go:638 msg="model load progress 0.25"
time=2024-08-24T22:32:00.736Z level=DEBUG source=server.go:638 msg="model load progress 0.27"
time=2024-08-24T22:32:00.987Z level=DEBUG source=server.go:638 msg="model load progress 0.28"
time=2024-08-24T22:32:01.238Z level=DEBUG source=server.go:638 msg="model load progress 0.29"
time=2024-08-24T22:32:01.489Z level=DEBUG source=server.go:638 msg="model load progress 0.30"
time=2024-08-24T22:32:01.740Z level=DEBUG source=server.go:638 msg="model load progress 0.31"
time=2024-08-24T22:32:01.991Z level=DEBUG source=server.go:638 msg="model load progress 0.33"
time=2024-08-24T22:32:02.492Z level=DEBUG source=server.go:638 msg="model load progress 0.35"
time=2024-08-24T22:32:02.743Z level=DEBUG source=server.go:638 msg="model load progress 0.36"
time=2024-08-24T22:32:02.994Z level=DEBUG source=server.go:638 msg="model load progress 0.37"
time=2024-08-24T22:32:03.244Z level=DEBUG source=server.go:638 msg="model load progress 0.38"
time=2024-08-24T22:32:03.495Z level=DEBUG source=server.go:638 msg="model load progress 0.39"
time=2024-08-24T22:32:03.746Z level=DEBUG source=server.go:638 msg="model load progress 0.40"
time=2024-08-24T22:32:03.997Z level=DEBUG source=server.go:638 msg="model load progress 0.41"
time=2024-08-24T22:32:04.248Z level=DEBUG source=server.go:638 msg="model load progress 0.42"
time=2024-08-24T22:32:04.498Z level=DEBUG source=server.go:638 msg="model load progress 0.43"
time=2024-08-24T22:32:04.749Z level=DEBUG source=server.go:638 msg="model load progress 0.44"
time=2024-08-24T22:32:05.000Z level=DEBUG source=server.go:638 msg="model load progress 0.45"
time=2024-08-24T22:32:05.251Z level=DEBUG source=server.go:638 msg="model load progress 0.46"
time=2024-08-24T22:32:05.501Z level=DEBUG source=server.go:638 msg="model load progress 0.47"
[GIN] 2024/08/24 - 22:32:05 | 200 |  214.837914ms |      172.19.0.1 | GET      "/api/tags"
time=2024-08-24T22:32:05.753Z level=DEBUG source=server.go:638 msg="model load progress 0.49"
time=2024-08-24T22:32:06.003Z level=DEBUG source=server.go:638 msg="model load progress 0.50"
time=2024-08-24T22:32:06.505Z level=DEBUG source=server.go:638 msg="model load progress 0.52"
time=2024-08-24T22:32:07.007Z level=DEBUG source=server.go:638 msg="model load progress 0.54"
time=2024-08-24T22:32:07.508Z level=DEBUG source=server.go:638 msg="model load progress 0.56"
time=2024-08-24T22:32:07.759Z level=DEBUG source=server.go:638 msg="model load progress 0.57"
time=2024-08-24T22:32:08.010Z level=DEBUG source=server.go:638 msg="model load progress 0.58"
time=2024-08-24T22:32:08.261Z level=DEBUG source=server.go:638 msg="model load progress 0.59"
time=2024-08-24T22:32:08.511Z level=DEBUG source=server.go:638 msg="model load progress 0.60"
time=2024-08-24T22:32:08.762Z level=DEBUG source=server.go:638 msg="model load progress 0.61"
time=2024-08-24T22:32:09.013Z level=DEBUG source=server.go:638 msg="model load progress 0.62"
time=2024-08-24T22:32:09.515Z level=DEBUG source=server.go:638 msg="model load progress 0.64"
time=2024-08-24T22:32:09.765Z level=DEBUG source=server.go:638 msg="model load progress 0.65"
time=2024-08-24T22:32:10.016Z level=DEBUG source=server.go:638 msg="model load progress 0.66"
time=2024-08-24T22:32:10.267Z level=DEBUG source=server.go:638 msg="model load progress 0.67"
time=2024-08-24T22:32:10.518Z level=DEBUG source=server.go:638 msg="model load progress 0.68"
time=2024-08-24T22:32:10.769Z level=DEBUG source=server.go:638 msg="model load progress 0.69"
time=2024-08-24T22:32:11.019Z level=DEBUG source=server.go:638 msg="model load progress 0.71"
time=2024-08-24T22:32:11.521Z level=DEBUG source=server.go:638 msg="model load progress 0.73"
time=2024-08-24T22:32:11.771Z level=DEBUG source=server.go:638 msg="model load progress 0.74"
time=2024-08-24T22:32:12.022Z level=DEBUG source=server.go:638 msg="model load progress 0.75"
time=2024-08-24T22:32:12.273Z level=DEBUG source=server.go:638 msg="model load progress 0.76"
time=2024-08-24T22:32:12.524Z level=DEBUG source=server.go:638 msg="model load progress 0.78"
time=2024-08-24T22:32:12.775Z level=DEBUG source=server.go:638 msg="model load progress 0.79"
time=2024-08-24T22:32:13.025Z level=DEBUG source=server.go:638 msg="model load progress 0.80"
time=2024-08-24T22:32:13.276Z level=DEBUG source=server.go:638 msg="model load progress 0.81"
time=2024-08-24T22:32:13.527Z level=DEBUG source=server.go:638 msg="model load progress 0.82"
time=2024-08-24T22:32:13.778Z level=DEBUG source=server.go:638 msg="model load progress 0.83"
time=2024-08-24T22:32:14.028Z level=DEBUG source=server.go:638 msg="model load progress 0.84"
time=2024-08-24T22:32:14.279Z level=DEBUG source=server.go:638 msg="model load progress 0.86"
time=2024-08-24T22:32:14.529Z level=DEBUG source=server.go:638 msg="model load progress 0.87"
time=2024-08-24T22:32:14.779Z level=DEBUG source=server.go:638 msg="model load progress 0.88"
time=2024-08-24T22:32:15.030Z level=DEBUG source=server.go:638 msg="model load progress 0.89"
time=2024-08-24T22:32:15.281Z level=DEBUG source=server.go:638 msg="model load progress 0.90"
time=2024-08-24T22:32:15.531Z level=DEBUG source=server.go:638 msg="model load progress 0.91"
time=2024-08-24T22:32:15.782Z level=DEBUG source=server.go:638 msg="model load progress 0.92"
time=2024-08-24T22:32:16.033Z level=DEBUG source=server.go:638 msg="model load progress 0.94"
time=2024-08-24T22:32:16.284Z level=DEBUG source=server.go:638 msg="model load progress 0.95"
time=2024-08-24T22:32:16.534Z level=DEBUG source=server.go:638 msg="model load progress 0.96"
time=2024-08-24T22:32:16.785Z level=DEBUG source=server.go:638 msg="model load progress 0.98"
time=2024-08-24T22:32:17.036Z level=DEBUG source=server.go:638 msg="model load progress 0.99"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-08-24T22:32:17.287Z level=DEBUG source=server.go:638 msg="model load progress 1.00"
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
time=2024-08-24T22:32:17.538Z level=DEBUG source=server.go:641 msg="model load completed, waiting for server to become available" status="llm server loading model"
DEBUG [initialize] initializing slots | n_slots=4 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="140093911531520" timestamp=1724538738
INFO [main] model loaded | tid="140093911531520" timestamp=1724538738
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="140093911531520" timestamp=1724538738
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="140093911531520" timestamp=1724538738
time=2024-08-24T22:32:18.290Z level=INFO source=server.go:632 msg="llama runner started in 24.08 seconds"
time=2024-08-24T22:32:18.290Z level=DEBUG source=sched.go:458 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-08-24T22:32:18.291Z level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the color of our sky?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="140093911531520" timestamp=1724538738
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=19 slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [print_timings] prompt eval time     =      44.27 ms /    19 tokens (    2.33 ms per token,   429.15 tokens per second) | n_prompt_tokens_processed=19 n_tokens_second=429.1457740434567 slot_id=0 t_prompt_processing=44.274 t_token=2.3302105263157897 task_id=2 tid="140093911531520" timestamp=1724538744
DEBUG [print_timings] generation eval time =    5633.89 ms /   171 runs   (   32.95 ms per token,    30.35 tokens per second) | n_decoded=171 n_tokens_second=30.352035689734034 slot_id=0 t_token=32.94671929824562 t_token_generation=5633.889 task_id=2 tid="140093911531520" timestamp=1724538744
DEBUG [print_timings]           total time =    5678.16 ms | slot_id=0 t_prompt_processing=44.274 t_token_generation=5633.889 t_total=5678.1630000000005 task_id=2 tid="140093911531520" timestamp=1724538744
DEBUG [update_slots] slot released | n_cache_tokens=190 n_ctx=8192 n_past=189 n_system_tokens=0 slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538744 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=35392 status=200 tid="140093251584000" timestamp=1724538744
[GIN] 2024/08/24 - 22:32:24 | 200 | 30.171645765s |      172.19.0.1 | POST     "/api/chat"
time=2024-08-24T22:32:24.013Z level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-24T22:32:24.013Z level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=2562047h47m16.854775807s
time=2024-08-24T22:32:24.013Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0

The two GPUs have been detected:

time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"

I passed the request specified with "main_gpu": 1, which should be the one b18a7f7a-1c7f-15b3-064d-6596d1a63015 But then it goes with:

time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d gpu=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 parallel=4 available=11793334272 required="9.7 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=server.go:101 msg="system memory" total="31.3 GiB" free="26.1 GiB" free_swap="16.0 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[11.0 GiB]" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[9.7 GiB]" memory.weights.total="7.9 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"

Which means it's trying to load the weight with d1fa0558-2f75-4c4b-0720-8d90d6b01898 (the gpu 0 instead of gpu 1)

dhiltgen commented 1 month ago

The scheduler doesn't currently interpret the main_gpu setting for GPU selection. You can filter GPUs via the GPU vendor specific environment variables.