Open tempstudio opened 1 month ago
@tempstudio could you check if the issue remains with the latest release (v2.2.0)?
I see the same issue with 2.2.1
` (Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 137)
INFO [ init] build info | tid="27560" timestamp=1725497899 build=3623 commit="436787f1"
INFO [ init] system info | tid="27560" timestamp=1725497899 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from E:/.../Assets/StreamingAssets/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
Loaded scene 'Temp/__Backupscenes/0.backup'
Deserialize: 5.726 ms
Integration: 341.064 ms
Integration of assets: 0.002 ms
Thread Wait Time: 0.004 ms
Total Operation Time: 346.796 ms
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 3
INFO [ init] build info | tid="27560" timestamp=1725497899 build=3623 commit="436787f1"
INFO [ init] system info | tid="27560" timestamp=1725497899 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
UnityEngine.StackTraceUtility:ExtractStackTrace ()
UnityEngine.DebugLogHandler:LogFormat (UnityEngine.LogType,UnityEngine.Object,string,object[])
UnityEngine.Logger:Log (UnityEngine.LogType,object)
UnityEngine.Debug:LogWarning (object)
LLMUnity.LLMUnitySetup:LogWarning (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs:143)
LLMUnity.StreamWrapper:Update () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMLib.cs:66)
LLMUnity.LLM:Update () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:483)
(Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 143)
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
Thank you for testing! I can't implement support for this card because it is down to llama.cpp. I'll see if I can wrap around the error however so that Unity doesn't crash and you can use the GPU with Vulkan. I'll send you later a build to try 🙏
Could you try the new build by changing the LlamaLib version here from v1.1.10
to v1.1.10-dev
?
You will also need to delete the undreamai-v1.1.10-llamacpp
folder from Assets/StreamingAssets
.
With this build it should skip the HIP build and use the Vulkan instead 🤞
Apologies: I was using the wrong binaries yesterday, so even though the C# code was 2.2.1 the native code in StreamingAssets were probably still the old version. I deleted the "StreamingAssets" directory and tried it again.
It didn't crash this time, after I deleted things from StreamingAssets and reinstalled the package. but I'm pretty sure it's using the CPU, with very slow speed, high CPU usage.
Server command: -m "C:/Users/.../AppData/Roaming/LLMUnity/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" -c 4096 -b 512 --log-disable -np 1 -ngl -1
UnityEngine.StackTraceUtility:ExtractStackTrace ()
UnityEngine.DebugLogHandler:LogFormat (UnityEngine.LogType,UnityEngine.Object,string,object[])
UnityEngine.Logger:Log (UnityEngine.LogType,object)
UnityEngine.Debug:Log (object)
LLMUnity.LLMUnitySetup:Log (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs:137)
LLMUnity.LLM:StartLLMServer (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:373)
LLMUnity.LLM/<>c__DisplayClass45_0:
(Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 137)
warning: not compiled with GPU offload support, --gpu-layers option will be ignored warning: see main README.md for information on enabling GPU BLAS support
...
llm_load_tensors: CPU buffer size = 4685.30 MiB ........................................................................................ llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: CPU output buffer size = 0.98 MiB llama_new_context_with_model: CPU compute buffer size = 296.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1
Giving the 1.1.0-dev a try now
The behavior is the same with 1.1.0-dev.
You are using num GPU layers -1 which will not use the GPU. Could you try e.g. with 10? There should be debug messages that start with "Tried architecture", can you post those as well?
I thought -1 would mean all / max? With 9999 GPU Layers it crashed with the same error even on 1.1.10-dev :/ I think it's been the same issue.
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4403.50 MiB llm_load_tensors: CPU buffer size = 281.81 MiB ........................................Asset Pipeline Refresh (id=2020b226d14d319468ddb810101aa4ca): Total: 0.008 seconds - Initiated by RefreshV2(NoUpdateAssetOptions) ............................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB llama_new_context_with_model: ROCm0 compute buffer size = 258.50 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 2 ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369 err D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error Asset Pipeline Refresh (id=7f5d46cd6ec704f4ba373546e19f8732): Total: 0.006 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)
Tried it with flash attention OFF and it's the same: ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4403.50 MiB llm_load_tensors: CPU buffer size = 281.81 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369 err D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error Unable to find style 'TemplatesPromo' in skin 'DarkSkin' Layout
Thanks a lot!
Could you do one more test with v1.1.10-dev2
?
Couple problems I encounted with 1.1.0-dev2: First, the install didn't work, it just installed an empty folder. I manually downloaded the entire zip and unzipped into the streaming assets folder. After that, the same error happens. Third, I deleted the two "windows-cuda" folders from the directory. It crashed again. Finally, I deleted the "windows-hip" folder from the directory, it doesn't crash anymore, but it doesn't use the GPU. It seems it's not even going to try Vulkan.
Thanks a lot. I have fixed the issue with the empty folder in v2.2.2. It seems I can't do much at the moment for the specific GPU unfortunately. I'll keep an eye on the llama.cpp updates and let you know once I find a solution.
I'm going through some issues and I have an idea. I may have to specify your GPU architecture in the HIP build.
Could you try the v1.1.11
build?
I have specifically set AMD architectures included the one of your GPU (gfx1030).
The good news is that it doesn't crash anymore. The bad news is that the performance is much worse than CPU only. running the chat pegs GPU usage to 100% and it stutters. It also took extremely long to generate anything. I recall having with llamafile and it was running at least 20x faster than this (this is with only 1 layer on the GPU; using all layers makes the OS unresponsive):
INFO [ print_timings] prompt eval time = 192189.92 ms / 399 tokens ( 481.68 ms per token, 2.08 tokens per second) | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_prompt_processing=192189.92 n_prompt_tokens_processed=399 t_token=481.67899749373436 n_tokens_second=2.0760714193543555 INFO [ print_timings] generation eval time = 24258.31 ms / 41 runs ( 591.67 ms per token, 1.69 tokens per second) | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_token_generation=24258.305 n_decoded=41 t_token=591.6659756097561 n_tokens_second=1.6901428191293661 INFO [ print_timings] total time = 216448.23 ms | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_prompt_processing=192189.92 t_token_generation=24258.305 t_total=216448.225 INFO [ update_slots] slot released | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 n_ctx=2048 n_past=439 n_system_tokens=0 n_cache_tokens=439 truncated=false INFO [ update_slots] all slots are idle | tid="5292" timestamp=1725926634 INFO [ update_slots] all slots are idle | tid="5292" timestamp=1725926634
I have updated to the latest drivers and also just restarted my system.
Yes! That works! What happens if you use more layers but not extreme ones e.g. 10, 25, 50?
Performance is equally bad with 10/30 layers.
10 layers: prompt processing 2tk/s generation 1tk/s 30 layers: prompt processing 2tk/s generation 0.5tk/s
Is there any possibility of the performance issue being fixed in llamalib? If not, is it possible to provide a 2.x build that uses llamafile as a backend?
I really doubt it is a problem of LlamaLib because I use and extend code directly from llama.cpp and llamafile.
This is an overview of the different libraries:
llama.cpp
it is the main implementation that all libraries use.
Specifically for GPU it uses CUDA (Nvidia) and CUDA+HIP (AMD).
This is the fastest but including CUDA in the builds increases the build size to 1GB / build.
To support most Nvidia GPUs I include both CUDA 11 and 12 builds that would mean 2 GBs.llamafile
It packages and serves llama.cpp in just a single file for all OSes.
Specifically for GPUs, it uses CUDA (Nvidia) and CUDA+HIP (AMD) if the system has CUDA already installed (rare, unless you are into AI).
Otherwise it uses its own tinyBLAS implementation which has speed lower or equal to CUDA (from version 0.7 onwards).
The benefit is that it needs less than 100MB to include in the build.LlamaLib
It extends llama.cpp with functionality needed to use as a Unity / C# library and builds binaries for the different architectures.
I use the llama.cpp implementation but specifically for GPUs I hack it and use tinyBLAS to keep the build size small.The source of the speed issue is most probably on the tinyBLAS implementation of llamafile. If you have CUDA installed or use llamafile with a version earlier than 0.7, llamafile will still use CUDA which will give you the speed boost.
There are reasons why I don't use llamafile anymore, although I love the project:
For these reasons I can't bring it back to the project. I'd prefer to find where the source of the problem is and solve it there. It is tricky for me to work with AMD because I don't have one and there is none available on the cloud that is supported.
You could try the following to understand more about the issue using the latest llamafile.
Check the timings for both cases:
llamafile without CUDA
C:/Users/<USER>
)llamafile-0.8.13.exe -m <path_to_model> -ngl 10 -p "to be or" --nocompile --tinyblas
llamafile with CUDA
C:/Users/<USER>
)llamafile-0.8.13.exe -m <path_to_model> -ngl 10 -p "to be or"
Then we could find out which implementation is the culprit.
I will give those a try. Can you build llamalib into a command line standalone so that I can test that too, just in case there's something wonky going on with gpu resource sharing between the ai and unity?
Here is the performance with tinyBLAS. I don't believe the CUDA run is needed as I'm using an AMD system and it doesn't support CUDA. I will be very happy if I can get this type of performance inside Unity.
.\llamafile-0.8.13.exe -m .\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 99 -p "to be or" --nocompile --tinyblas -c 2048
llama_print_timings: load time = 2364.86 ms llama_print_timings: sample time = 55.42 ms / 773 runs ( 0.07 ms per token, 13948.79 tokens per second) llama_print_timings: prompt eval time = 36.01 ms / 4 tokens ( 9.00 ms per token, 111.08 tokens per second) llama_print_timings: eval time = 22152.88 ms / 772 runs ( 28.70 ms per token, 34.85 tokens per second) llama_print_timings: total time = 22420.02 ms / 776 tokens Log end
More logs that might be helpful:
import_cuda_impl: initializing gpu module... get_rocm_bin_path: note: amdclang++.exe not found on $PATH get_rocm_bin_path: note: /D/Drivers/ROCM/5.7//bin/amdclang++.exe does not exist get_rocm_bin_path: note: clang++.exe not found on $PATH link_cuda_dso: note: dynamically linking /C/Users/Tony/.llamafile/v/0.8.13/ggml-rocm.dll ggml_cuda_link: welcome to ROCm SDK with tinyBLAS link_cuda_dso: GPU support loaded llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloaded 32/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 3992.51 MiB llm_load_tensors: CPU buffer size = 4685.30 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.49 MiB llama_new_context_with_model: ROCm0 compute buffer size = 669.48 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 4
Another piece of info: during the execution I see that the GPU usage is at 1% instead of 99% that I see when using llamalib in task manager. This might be inaccurate.
Describe the bug
Crash with abort when trying to use AMD graphics card in editor Model is mistral-7b-instruct-v0.2.Q4_K_M.gguf
ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.30 MiB d3d12: upload buffer was full! Waited for COPY queue for 1.118 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.902 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.897 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.896 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.901 ms. [Licensing::Client] Successfully resolved entitlement details llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4095.05 MiB llm_load_tensors: CPU buffer size = 70.31 MiB .............................................................................................. llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.24 MiB llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 [1722650470] warming up the model with an empty run ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml-cuda.cu:13061 err Asset Pipeline Refresh (id=5fe1348313ec9e4439edb8aa2e9d608c): Total: 0.010 seconds - Initiated by RefreshV2(NoUpdateAssetOptions) Asset Pipeline Refresh (id=a398558039bd1ba4a8f2fc04f6154810): Total: 0.007 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)
Steps to reproduce
No response
LLMUnity version
2.0.3
Operating System
Windows