Scheduler should respect main_gpu on multi-gpu setup

2024/08/24 22:31:35 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-24T22:31:35.687Z level=INFO source=images.go:782 msg="total blobs: 62"
time=2024-08-24T22:31:35.946Z level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-24T22:31:36.076Z level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6)"
time=2024-08-24T22:31:36.077Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3948064601/runners
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/deps.txt.gz
time=2024-08-24T22:31:36.077Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/ollama_llama_server.gz
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx2/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/rocm_v60102/ollama_llama_server
time=2024-08-24T22:31:39.255Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60102 cpu cpu_avx]"
time=2024-08-24T22:31:39.255Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-08-24T22:31:39.255Z level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-24T22:31:39.255Z level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-24T22:31:39.255Z level=DEBUG source=gpu.go:90 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-08-24T22:31:39.255Z level=DEBUG source=gpu.go:472 msg="Searching for GPU library" name=libcuda.so*
time=2024-08-24T22:31:39.255Z level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-08-24T22:31:39.256Z level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/wsl/drivers/nv_dispi.inf_amd64_78cd02ab022cd554/libcuda.so.1.1]"
CUDA driver version: 12.6
time=2024-08-24T22:31:39.296Z level=DEBUG source=gpu.go:123 msg="detected GPUs" count=2 library=/usr/lib/x86_64-linux-gnu/libcuda.so.1
[GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898] CUDA totalMem 12287 mb
[GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898] CUDA freeMem 11247 mb
[GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898] Compute Capability 8.6
[GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015] CUDA totalMem 12287 mb
[GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015] CUDA freeMem 11247 mb
[GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015] Compute Capability 8.6
time=2024-08-24T22:31:39.601Z level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
time=2024-08-24T22:31:53.914Z level=DEBUG source=gpu.go:362 msg="updating system memory data" before.total="31.3 GiB" before.free="26.8 GiB" before.free_swap="16.0 GiB" now.total="31.3 GiB" now.free="26.1 GiB" now.free_swap="16.0 GiB"
CUDA driver version: 12.6
time=2024-08-24T22:31:54.037Z level=DEBUG source=gpu.go:410 msg="updating cuda memory data" gpu=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="1.0 GiB"
time=2024-08-24T22:31:54.167Z level=DEBUG source=gpu.go:410 msg="updating cuda memory data" gpu=GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="1.0 GiB"
releasing cuda driver library
time=2024-08-24T22:31:54.167Z level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x8161c0 gpu_count=2
time=2024-08-24T22:31:54.211Z level=DEBUG source=sched.go:219 msg="loading first model" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d gpu=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 parallel=4 available=11793334272 required="9.7 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=server.go:101 msg="system memory" total="31.3 GiB" free="26.1 GiB" free_swap="16.0 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[11.0 GiB]" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[9.7 GiB]" memory.weights.total="7.9 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx2/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/rocm_v60102/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cpu_avx2/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server
time=2024-08-24T22:31:54.212Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama3948064601/runners/rocm_v60102/ollama_llama_server
time=2024-08-24T22:31:54.213Z level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama3948064601/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --main-gpu 1 --no-mmap --parallel 4 --port 39689"
time=2024-08-24T22:31:54.213Z level=DEBUG source=server.go:410 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/tmp/ollama3948064601/runners/cuda_v11:/tmp/ollama3948064601/runners:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898]"
time=2024-08-24T22:31:54.213Z level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-24T22:31:54.213Z level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-24T22:31:54.214Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="1e6f655" tid="140093911531520" timestamp=1724538714
INFO [main] system info | n_threads=12 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140093911531520" timestamp=1724538714 total_threads=24
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="39689" tid="140093911531520" timestamp=1724538714
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2024-08-24T22:31:54.465Z level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   532.31 MiB
llm_load_tensors:      CUDA0 buffer size =  7605.34 MiB
time=2024-08-24T22:31:56.723Z level=DEBUG source=server.go:638 msg="model load progress 0.07"
time=2024-08-24T22:31:57.977Z level=DEBUG source=server.go:638 msg="model load progress 0.13"
time=2024-08-24T22:31:58.228Z level=DEBUG source=server.go:638 msg="model load progress 0.14"
time=2024-08-24T22:31:58.479Z level=DEBUG source=server.go:638 msg="model load progress 0.15"
time=2024-08-24T22:31:58.730Z level=DEBUG source=server.go:638 msg="model load progress 0.16"
time=2024-08-24T22:31:58.981Z level=DEBUG source=server.go:638 msg="model load progress 0.18"
time=2024-08-24T22:31:59.231Z level=DEBUG source=server.go:638 msg="model load progress 0.19"
time=2024-08-24T22:31:59.482Z level=DEBUG source=server.go:638 msg="model load progress 0.21"
time=2024-08-24T22:31:59.733Z level=DEBUG source=server.go:638 msg="model load progress 0.22"
time=2024-08-24T22:31:59.984Z level=DEBUG source=server.go:638 msg="model load progress 0.23"
time=2024-08-24T22:32:00.235Z level=DEBUG source=server.go:638 msg="model load progress 0.24"
time=2024-08-24T22:32:00.485Z level=DEBUG source=server.go:638 msg="model load progress 0.25"
time=2024-08-24T22:32:00.736Z level=DEBUG source=server.go:638 msg="model load progress 0.27"
time=2024-08-24T22:32:00.987Z level=DEBUG source=server.go:638 msg="model load progress 0.28"
time=2024-08-24T22:32:01.238Z level=DEBUG source=server.go:638 msg="model load progress 0.29"
time=2024-08-24T22:32:01.489Z level=DEBUG source=server.go:638 msg="model load progress 0.30"
time=2024-08-24T22:32:01.740Z level=DEBUG source=server.go:638 msg="model load progress 0.31"
time=2024-08-24T22:32:01.991Z level=DEBUG source=server.go:638 msg="model load progress 0.33"
time=2024-08-24T22:32:02.492Z level=DEBUG source=server.go:638 msg="model load progress 0.35"
time=2024-08-24T22:32:02.743Z level=DEBUG source=server.go:638 msg="model load progress 0.36"
time=2024-08-24T22:32:02.994Z level=DEBUG source=server.go:638 msg="model load progress 0.37"
time=2024-08-24T22:32:03.244Z level=DEBUG source=server.go:638 msg="model load progress 0.38"
time=2024-08-24T22:32:03.495Z level=DEBUG source=server.go:638 msg="model load progress 0.39"
time=2024-08-24T22:32:03.746Z level=DEBUG source=server.go:638 msg="model load progress 0.40"
time=2024-08-24T22:32:03.997Z level=DEBUG source=server.go:638 msg="model load progress 0.41"
time=2024-08-24T22:32:04.248Z level=DEBUG source=server.go:638 msg="model load progress 0.42"
time=2024-08-24T22:32:04.498Z level=DEBUG source=server.go:638 msg="model load progress 0.43"
time=2024-08-24T22:32:04.749Z level=DEBUG source=server.go:638 msg="model load progress 0.44"
time=2024-08-24T22:32:05.000Z level=DEBUG source=server.go:638 msg="model load progress 0.45"
time=2024-08-24T22:32:05.251Z level=DEBUG source=server.go:638 msg="model load progress 0.46"
time=2024-08-24T22:32:05.501Z level=DEBUG source=server.go:638 msg="model load progress 0.47"
[GIN] 2024/08/24 - 22:32:05 | 200 |  214.837914ms |      172.19.0.1 | GET      "/api/tags"
time=2024-08-24T22:32:05.753Z level=DEBUG source=server.go:638 msg="model load progress 0.49"
time=2024-08-24T22:32:06.003Z level=DEBUG source=server.go:638 msg="model load progress 0.50"
time=2024-08-24T22:32:06.505Z level=DEBUG source=server.go:638 msg="model load progress 0.52"
time=2024-08-24T22:32:07.007Z level=DEBUG source=server.go:638 msg="model load progress 0.54"
time=2024-08-24T22:32:07.508Z level=DEBUG source=server.go:638 msg="model load progress 0.56"
time=2024-08-24T22:32:07.759Z level=DEBUG source=server.go:638 msg="model load progress 0.57"
time=2024-08-24T22:32:08.010Z level=DEBUG source=server.go:638 msg="model load progress 0.58"
time=2024-08-24T22:32:08.261Z level=DEBUG source=server.go:638 msg="model load progress 0.59"
time=2024-08-24T22:32:08.511Z level=DEBUG source=server.go:638 msg="model load progress 0.60"
time=2024-08-24T22:32:08.762Z level=DEBUG source=server.go:638 msg="model load progress 0.61"
time=2024-08-24T22:32:09.013Z level=DEBUG source=server.go:638 msg="model load progress 0.62"
time=2024-08-24T22:32:09.515Z level=DEBUG source=server.go:638 msg="model load progress 0.64"
time=2024-08-24T22:32:09.765Z level=DEBUG source=server.go:638 msg="model load progress 0.65"
time=2024-08-24T22:32:10.016Z level=DEBUG source=server.go:638 msg="model load progress 0.66"
time=2024-08-24T22:32:10.267Z level=DEBUG source=server.go:638 msg="model load progress 0.67"
time=2024-08-24T22:32:10.518Z level=DEBUG source=server.go:638 msg="model load progress 0.68"
time=2024-08-24T22:32:10.769Z level=DEBUG source=server.go:638 msg="model load progress 0.69"
time=2024-08-24T22:32:11.019Z level=DEBUG source=server.go:638 msg="model load progress 0.71"
time=2024-08-24T22:32:11.521Z level=DEBUG source=server.go:638 msg="model load progress 0.73"
time=2024-08-24T22:32:11.771Z level=DEBUG source=server.go:638 msg="model load progress 0.74"
time=2024-08-24T22:32:12.022Z level=DEBUG source=server.go:638 msg="model load progress 0.75"
time=2024-08-24T22:32:12.273Z level=DEBUG source=server.go:638 msg="model load progress 0.76"
time=2024-08-24T22:32:12.524Z level=DEBUG source=server.go:638 msg="model load progress 0.78"
time=2024-08-24T22:32:12.775Z level=DEBUG source=server.go:638 msg="model load progress 0.79"
time=2024-08-24T22:32:13.025Z level=DEBUG source=server.go:638 msg="model load progress 0.80"
time=2024-08-24T22:32:13.276Z level=DEBUG source=server.go:638 msg="model load progress 0.81"
time=2024-08-24T22:32:13.527Z level=DEBUG source=server.go:638 msg="model load progress 0.82"
time=2024-08-24T22:32:13.778Z level=DEBUG source=server.go:638 msg="model load progress 0.83"
time=2024-08-24T22:32:14.028Z level=DEBUG source=server.go:638 msg="model load progress 0.84"
time=2024-08-24T22:32:14.279Z level=DEBUG source=server.go:638 msg="model load progress 0.86"
time=2024-08-24T22:32:14.529Z level=DEBUG source=server.go:638 msg="model load progress 0.87"
time=2024-08-24T22:32:14.779Z level=DEBUG source=server.go:638 msg="model load progress 0.88"
time=2024-08-24T22:32:15.030Z level=DEBUG source=server.go:638 msg="model load progress 0.89"
time=2024-08-24T22:32:15.281Z level=DEBUG source=server.go:638 msg="model load progress 0.90"
time=2024-08-24T22:32:15.531Z level=DEBUG source=server.go:638 msg="model load progress 0.91"
time=2024-08-24T22:32:15.782Z level=DEBUG source=server.go:638 msg="model load progress 0.92"
time=2024-08-24T22:32:16.033Z level=DEBUG source=server.go:638 msg="model load progress 0.94"
time=2024-08-24T22:32:16.284Z level=DEBUG source=server.go:638 msg="model load progress 0.95"
time=2024-08-24T22:32:16.534Z level=DEBUG source=server.go:638 msg="model load progress 0.96"
time=2024-08-24T22:32:16.785Z level=DEBUG source=server.go:638 msg="model load progress 0.98"
time=2024-08-24T22:32:17.036Z level=DEBUG source=server.go:638 msg="model load progress 0.99"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-08-24T22:32:17.287Z level=DEBUG source=server.go:638 msg="model load progress 1.00"
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
time=2024-08-24T22:32:17.538Z level=DEBUG source=server.go:641 msg="model load completed, waiting for server to become available" status="llm server loading model"
DEBUG [initialize] initializing slots | n_slots=4 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="140093911531520" timestamp=1724538738
INFO [main] model loaded | tid="140093911531520" timestamp=1724538738
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="140093911531520" timestamp=1724538738
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="140093911531520" timestamp=1724538738
time=2024-08-24T22:32:18.290Z level=INFO source=server.go:632 msg="llama runner started in 24.08 seconds"
time=2024-08-24T22:32:18.290Z level=DEBUG source=sched.go:458 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d
time=2024-08-24T22:32:18.291Z level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the color of our sky?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="140093911531520" timestamp=1724538738
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=19 slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538738
DEBUG [print_timings] prompt eval time     =      44.27 ms /    19 tokens (    2.33 ms per token,   429.15 tokens per second) | n_prompt_tokens_processed=19 n_tokens_second=429.1457740434567 slot_id=0 t_prompt_processing=44.274 t_token=2.3302105263157897 task_id=2 tid="140093911531520" timestamp=1724538744
DEBUG [print_timings] generation eval time =    5633.89 ms /   171 runs   (   32.95 ms per token,    30.35 tokens per second) | n_decoded=171 n_tokens_second=30.352035689734034 slot_id=0 t_token=32.94671929824562 t_token_generation=5633.889 task_id=2 tid="140093911531520" timestamp=1724538744
DEBUG [print_timings]           total time =    5678.16 ms | slot_id=0 t_prompt_processing=44.274 t_token_generation=5633.889 t_total=5678.1630000000005 task_id=2 tid="140093911531520" timestamp=1724538744
DEBUG [update_slots] slot released | n_cache_tokens=190 n_ctx=8192 n_past=189 n_system_tokens=0 slot_id=0 task_id=2 tid="140093911531520" timestamp=1724538744 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=35392 status=200 tid="140093251584000" timestamp=1724538744
[GIN] 2024/08/24 - 22:32:24 | 200 | 30.171645765s |      172.19.0.1 | POST     "/api/chat"
time=2024-08-24T22:32:24.013Z level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-24T22:32:24.013Z level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d duration=2562047h47m16.854775807s
time=2024-08-24T22:32:24.013Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d refCount=0

The two GPUs have been detected:

time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
time=2024-08-24T22:31:39.601Z level=INFO source=types.go:105 msg="inference compute" id=GPU-b18a7f7a-1c7f-15b3-064d-6596d1a63015 library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"

I passed the request specified with "main_gpu": 1, which should be the one b18a7f7a-1c7f-15b3-064d-6596d1a63015 But then it goes with:

time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-cc04e85e1f866a5ba87dd66b5260f0cb32354e2c66505e86a7ac3c0092272b7d gpu=GPU-d1fa0558-2f75-4c4b-0720-8d90d6b01898 parallel=4 available=11793334272 required="9.7 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=server.go:101 msg="system memory" total="31.3 GiB" free="26.1 GiB" free_swap="16.0 GiB"
time=2024-08-24T22:31:54.211Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[11.0 GiB]"
time=2024-08-24T22:31:54.211Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[11.0 GiB]" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[9.7 GiB]" memory.weights.total="7.9 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"

Which means it's trying to load the weight with d1fa0558-2f75-4c4b-0720-8d90d6b01898 (the gpu 0 instead of gpu 1)

ollama / ollama

Scheduler should respect main_gpu on multi-gpu setup #6493

What is the issue?

OS

GPU

CPU

Ollama version