v0.3.2 High CPU (no GPU offload?)

nPHYN1T3 commented 2 months ago

I'm noticing with v0.3.2 my CPU is getting slaughtered. The UI revamp is worse than the previous iteration with GPU offload now hidden on "My Models" page but even with all the layers assigned to GPU my CPU has all the work.

record scratch I just went and fiddled with things again and now it looks like it's cross loading across all GPU's and CPU. Which seems like settings are not "taking" correctly. However CPU use is high just with LMStudio open and idle compared to 0.2.8.

As I was basically once again sliding to 0, slide to max just to test the model also seemed to loose everything from today. As I asked a new question to see where the loading went it says there is a miscommunication and tries to answer something from like a week ago. What? Magic persistence?!

It's starting to look like I should just come back to this in a few years...

mylesgoose commented 2 months ago

I am having the same problem. if i load a model onto gpu in windows 11 pro with amd gpu it runs only on the cpu and loads the models to the gpus. in the llam cpp it says gguf runtime: CUDA llama.cpp v1.1.7 { "result": { "code": "Success", "message": "" }, "gpuInfo": [ { "name": "NVIDIA GeForce RTX 4090", "deviceId": 0, "totalMemoryCapacityBytes": 25756696576, "dedicatedMemoryCapacityBytes": 0, "integrationType": "Discrete", "detectionPlatform": "CUDA", "detectionPlatformVersion": "", "otherInfo": {} }, { "name": "NVIDIA GeForce RTX 4090", "deviceId": 1, "totalMemoryCapacityBytes": 25756696576, "dedicatedMemoryCapacityBytes": 0, "integrationType": "Discrete", "detectionPlatform": "CUDA", "detectionPlatformVersion": "", "otherInfo": {} }, { "name": "NVIDIA GeForce RTX 4090", "deviceId": 2, "totalMemoryCapacityBytes": 25756696576, "dedicatedMemoryCapacityBytes": 0, "integrationType": "Discrete", "detectionPlatform": "CUDA", "detectionPlatformVersion": "", "otherInfo": {} }, { "name": "NVIDIA GeForce RTX 4090", "deviceId": 3, "totalMemoryCapacityBytes": 25756696576, "dedicatedMemoryCapacityBytes": 0, "integrationType": "Discrete", "detectionPlatform": "CUDA", "detectionPlatformVersion": "", "otherInfo": {} }, { "name": "NVIDIA GeForce RTX 4090", "deviceId": 4, "totalMemoryCapacityBytes": 25756696576, "dedicatedMemoryCapacityBytes": 0, "integrationType": "Discrete", "detectionPlatform": "CUDA", "detectionPlatformVersion": "", "otherInfo": {} }, { "name": "NVIDIA GeForce RTX 4090", "deviceId": 5, "totalMemoryCapacityBytes": 25756696576, "dedicatedMemoryCapacityBytes": 0, "integrationType": "Discrete", "detectionPlatform": "CUDA", "detectionPlatformVersion": "", "otherInfo": {} } ] }{ "result": { "code": "Success", "message": "" }, "cpuInfo": { "architecture": "x86_64", "supportedInstructionSetExtensions": [ "AVX", "AVX2" ] } }[Client=LM Studio] Client created. 2024-09-03 19:08:45 [DEBUG] AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 2024-09-03 19:08:45 [DEBUG] llama_model_loader: loaded meta data with 32 key-value pairs and 292 tensors from C:\Users\diann.cache\lm-studio\models\mlabonne\Meta-Llama-3.1-8B-Instruct-abliterated-GGUF\meta-llama-3.1-8b-instruct-abliterated.bf16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met... llama_model_loader: - kv 11: general.tags arr[str,2] = ["abliterated", "uncensored"] llama_model_loader: - kv 12: llama.block_count u32 = 32 llama_model_loader: - kv 13: llama.context_length u32 = 131072 llama_model_loader: - kv 14: llama.embedding_length u32 = 4096 llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 16: llama.attention.head_count u32 = 32 llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 18: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 20: general.file_type u32 = 32 llama_model_loader: - kv 21: llama.vocab_size u32 = 128256 llama_model_loader: - kv 22: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = llama-bpe 2024-09-03 19:08:45 [DEBUG] llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-09-03 19:08:45 [DEBUG] llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-09-03 19:08:46 [DEBUG] llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 30: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 31: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type bf16: 226 tensors 2024-09-03 19:08:46 [DEBUG] llm_load_vocab: special tokens cache size = 256 2024-09-03 19:08:46 [DEBUG] llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 2024-09-03 19:08:46 [DEBUG] llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = BF16 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 2024-09-03 19:08:46 [DEBUG] ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 6 CUDA devices: 2024-09-03 19:08:46 [DEBUG] Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 4: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 5: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes 2024-09-03 19:08:48 [DEBUG] llm_load_tensors: ggml ctx size = 0.96 MiB 2024-09-03 19:08:51 [DEBUG] llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 1002.00 MiB llm_load_tensors: CUDA0 buffer size = 2496.19 MiB llm_load_tensors: CUDA1 buffer size = 2080.16 MiB llm_load_tensors: CUDA2 buffer size = 2496.19 MiB llm_load_tensors: CUDA3 buffer size = 2080.16 MiB llm_load_tensors: CUDA4 buffer size = 2496.19 MiB llm_load_tensors: CUDA5 buffer size = 2666.14 MiB 2024-09-03 19:08:56 [DEBUG] llama_new_context_with_model: n_ctx = 97408 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 2024-09-03 19:08:56 [DEBUG] llama_kv_cache_init: CUDA0 KV buffer size = 2283.00 MiB 2024-09-03 19:08:56 [DEBUG] llama_kv_cache_init: CUDA1 KV buffer size = 1902.50 MiB 2024-09-03 19:08:57 [DEBUG] llama_kv_cache_init: CUDA2 KV buffer size = 2283.00 MiB 2024-09-03 19:08:57 [DEBUG] llama_kv_cache_init: CUDA3 KV buffer size = 1902.50 MiB 2024-09-03 19:08:57 [DEBUG] llama_kv_cache_init: CUDA4 KV buffer size = 2283.00 MiB 2024-09-03 19:08:57 [DEBUG] llama_kv_cache_init: CUDA5 KV buffer size = 1522.00 MiB llama_new_context_with_model: KV self size = 12176.00 MiB, K (f16): 6088.00 MiB, V (f16): 6088.00 MiB 2024-09-03 19:08:57 [DEBUG] llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB 2024-09-03 19:08:57 [DEBUG] llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) 2024-09-03 19:08:57 [DEBUG] ggml_cuda_host_malloc: failed to allocate 64955.52 MiB of pinned memory: out of memory 2024-09-03 19:08:57 [DEBUG] llama_new_context_with_model: CUDA0 compute buffer size = 8913.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 8577.01 MiB llama_new_context_with_model: CUDA2 compute buffer size = 8913.01 MiB llama_new_context_with_model: CUDA3 compute buffer size = 8577.01 MiB llama_new_context_with_model: CUDA4 compute buffer size = 8913.01 MiB llama_new_context_with_model: CUDA5 compute buffer size = 8241.02 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 64955.52 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 456 2024-09-03 19:09:35 [DEBUG] sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 97408, n_batch = 512, n_predict = -1, n_keep = 38 2024-09-03 19:10:31 [DEBUG]

llama_print_timings: load time = 18845.67 ms llama_print_timings: sample time = 2.95 ms / 8 runs ( 0.37 ms per token, 2711.86 tokens per second) llama_print_timings: prompt eval time = 13268.44 ms / 11 tokens ( 1206.22 ms per token, 0.83 tokens per second) llama_print_timings: eval time = 43051.32 ms / 7 runs ( 6150.19 ms per token, 0.16 tokens per second) llama_print_timings: total time = 56332.32 ms / 18 tokens 2024-09-03 19:11:42 [DEBUG] AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 2024-09-03 19:11:42 [DEBUG] llama_model_loader: loaded meta data with 32 key-value pairs and 292 tensors from C:\Users\diann.cache\lm-studio\models\mlabonne\Meta-Llama-3.1-8B-Instruct-abliterated-GGUF\meta-llama-3.1-8b-instruct-abliterated.bf16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Abliterated llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met... llama_model_loader: - kv 11: general.tags arr[str,2] = ["abliterated", "uncensored"] llama_model_loader: - kv 12: llama.block_count u32 = 32 llama_model_loader: - kv 13: llama.context_length u32 = 131072 llama_model_loader: - kv 14: llama.embedding_length u32 = 4096 llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 16: llama.attention.head_count u32 = 32 llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 18: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 20: general.file_type u32 = 32 llama_model_loader: - kv 21: llama.vocab_size u32 = 128256 2024-09-03 19:11:42 [DEBUG] llama_model_loader: - kv 22: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = llama-bpe 2024-09-03 19:11:42 [DEBUG] llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-09-03 19:11:42 [DEBUG] llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-09-03 19:11:42 [DEBUG] llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 30: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 31: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type bf16: 226 tensors 2024-09-03 19:11:42 [DEBUG] llm_load_vocab: special tokens cache size = 256 2024-09-03 19:11:42 [DEBUG] llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 2024-09-03 19:11:42 [DEBUG] llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = BF16 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Abliterated llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 2024-09-03 19:11:43 [DEBUG] ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 6 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 4: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 5: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes 2024-09-03 19:11:44 [DEBUG] llm_load_tensors: ggml ctx size = 0.96 MiB 2024-09-03 19:11:46 [DEBUG] llm_load_tensors: offloading 31 repeating layers to GPU llm_load_tensors: offloaded 31/33 layers to GPU llm_load_tensors: CPU buffer size = 15317.02 MiB llm_load_tensors: CUDA0 buffer size = 2496.19 MiB llm_load_tensors: CUDA1 buffer size = 2080.16 MiB llm_load_tensors: CUDA2 buffer size = 2080.16 MiB llm_load_tensors: CUDA3 buffer size = 2080.16 MiB llm_load_tensors: CUDA4 buffer size = 2080.16 MiB llm_load_tensors: CUDA5 buffer size = 2080.16 MiB 2024-09-03 19:11:59 [DEBUG] llama_new_context_with_model: n_ctx = 97408 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 2024-09-03 19:11:59 [DEBUG] llama_kv_cache_init: CUDA_Host KV buffer size = 380.50 MiB 2024-09-03 19:11:59 [DEBUG] llama_kv_cache_init: CUDA0 KV buffer size = 2283.00 MiB 2024-09-03 19:11:59 [DEBUG] llama_kv_cache_init: CUDA1 KV buffer size = 1902.50 MiB 2024-09-03 19:11:59 [DEBUG] llama_kv_cache_init: CUDA2 KV buffer size = 1902.50 MiB 2024-09-03 19:11:59 [DEBUG] llama_kv_cache_init: CUDA3 KV buffer size = 1902.50 MiB 2024-09-03 19:11:59 [DEBUG] llama_kv_cache_init: CUDA4 KV buffer size = 1902.50 MiB 2024-09-03 19:11:59 [DEBUG] llama_kv_cache_init: CUDA5 KV buffer size = 1902.50 MiB llama_new_context_with_model: KV self size = 12176.00 MiB, K (f16): 6088.00 MiB, V (f16): 6088.00 MiB 2024-09-03 19:11:59 [DEBUG] llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB 2024-09-03 19:12:01 [DEBUG] llama_new_context_with_model: CUDA0 compute buffer size = 6298.25 MiB llama_new_context_with_model: CUDA1 compute buffer size = 6298.25 MiB llama_new_context_with_model: CUDA2 compute buffer size = 6298.25 MiB llama_new_context_with_model: CUDA3 compute buffer size = 6298.25 MiB llama_new_context_with_model: CUDA4 compute buffer size = 6298.25 MiB llama_new_context_with_model: CUDA5 compute buffer size = 6298.25 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6294.26 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 451 2024-09-03 19:18:50 [DEBUG] sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 97408, n_batch = 512, n_predict = -1, n_keep = 38 You hello

Assistant Meta-Llama-3.1-8B-Instruct-abliterated-GGUF Hello again! Is there something I can help you with, or would you like to chat? I'm here to listen and assist in any way I can.

0.34 tok/sec

•

32 tokens

•

18.02s to first token

•

Stop: eosFound so with 6 rtx 4090 and a 8b model it is running at 0.34 tokens per second? should be well over 100 tok/s works perfectly fine if boot into linux and try the lm studiio linux verison. on the same cpu and gpu combo.

smirgol commented 2 months ago

I'm not sure if that helps, but I had a similar problem where GPU acceleration just wasn't available anymore on my AMD RX7900XTX. I finally found out why - by default CPU mode is enabled and it will not even show any GPU info, like if none exists. That can be changed in a rather hidden setting:

You need to switch the GUI to "Developer" mode - on the bottom left of the GUI there should be three buttons, one being "Developer".
Then on the left there should be a new green icon with tooltip "Developer". Click that.
In the main window at the top, right of the "LM Studio" there should be two buttons "Local Server" and "LM Runtimes". Click "LM Runtimes".
Now, on the right side there is "Runtime Preferences" and below that "Configure Runtimes".
Switch the runtime for "GGUF" from "CPU llama.cpp" to "Vulkan llama.cpp".
There you go.

Maybe that helps someone. Cheers!

nPHYN1T3 commented 2 months ago

I already outlined the moved/hidden GPU settings in the OP. Using the new GPU offload UI the workload is still dumped on the CPU. I've ditched lmstudio (0.3.x is terribly broken) and just gone back to ollama.

smirgol commented 2 months ago

True, but note that setting the GPU Offload setting on the My Models tab won't do anything if LM-Studio is running the cpu-only llama, which was my issue and which can be fixed as described above.

aqasem81 commented 1 month ago

Same issue here, I'm running this on Windows 11 and have iGPU Arc and Nvidia 4060 the load by default is on the CPU and when I tried the GPU offload setting nothing changed.

Bo-The-Lab commented 3 days ago

I was having the same issue: LM Studio only running models on my Laptops CPU. While I was testing the CPU Offload setting, I noticed the drop-down box above all the driver options.

Select the preferred graphics processor.

Once I chose the "High Performance NVIDIA Processor", LM Studio would use the GPU. Looks like the settings for the GPU mux is somehow getting in the way. I always assumed the mux was in the video output section, after the frame buffers.

Now LM Studio is directed to the correct GPU, I've observed some other problem with the model I'm trying out this week and/or LM Studio settings. I get one of two behaviours:

The GPU runs at 100% with no output at all, I assume the model has crashed somehow.
There is a burst of GPU activity, about 70% for up to 30 seconds, then all output takes place on the CPU. Subsquent generations have a shorter run time of this process, jumping to the CPU faster. Fast enough for testing and messing around with, but makes the GPU feel like a waste of money. :shrug:

lmstudio-ai / lmstudio-bug-tracker

v0.3.2 High CPU (no GPU offload?) #102