Closed eramax closed 2 months ago
this is the second try
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q5_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 4.45 GiB (5.68 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 86.04 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4474.93 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 76.63 MB
llama_new_context_with_model: VRAM scratch buffer: 70.50 MB
llama_new_context_with_model: total VRAM used: 4801.43 MB (model: 4474.93 MB, context: 326.50 MB)
Answer:
[2023-11-14 21:35:13.952] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
WSL (Windows Subsystem for Linux) is a feature in Windows 10 that allows users to run a full Linux environment directly on top of the Windows operating system. This means that users can run Linux commands, applications, and tools alongside Windows applications, without the need to install a separate Linux operating system. WSL is designed to provide a seamless and efficient way to use Linux on Windows, while still taking advantage of the security and stability of the Windows platform.
llama_print_timings: load time = 129174.03 ms
llama_print_timings: sample time = 3.85 ms / 98 runs ( 0.04 ms per token, 25461.16 tokens per second)
llama_print_timings: prompt eval time = 1295.73 ms / 43 tokens ( 30.13 ms per token, 33.19 tokens per second)
llama_print_timings: eval time = 21818.37 ms / 97 runs ( 224.93 ms per token, 4.45 tokens per second)
llama_print_timings: total time = 151004.54 ms
this time it didn't crash after the first question, but always it unload the model from gpu and when i ask it again it reload it which mean it takes like 2-3min again to load the model, I noticed that this app doesn't do pressure on cpu not like other apps (e.g, ollama)
Please help to fix the load time, i guess this aoo shall be a perfect solution.
Best,
The 1050 Ti only has 4GB of RAM. The quantized 4B model itself is 4GB. So, it barely runs ... When you ask more questions, the context length grows and the machine demands more GPU RAM, which causes failure.
One thing you could try is to tweak the -ngl CLI argument to reduce the number of layers in the GPU (say set it to 10). It will make the inference slower, but it can make use of CPU RAMs.
@hydai Can you look into the "reloading after each question" problem? I thought we addressed it on our test machines? Thanks.
I tried the llama-chat.wasm and It couldn't work
➜ llama-2-7b-chat-Q5_K_M-gguf wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
llama-chat.wasm -c 2048 -n 512 --log-stat
[INFO] Model alias: default
[INFO] Prompt context size: 2048
[INFO] Number of tokens to predict: 512
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 4096
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Stream stdout: false
[INFO] Log prompts: false
[INFO] Log statistics: true
[INFO] Log all information: false
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1
[2023-11-14 21:59:22.222] [error] [WASI-NN] GGML backend: Error: unable to init model.
Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Caller module passed an invalid argument"
➜ llama-2-7b-chat-Q5_K_M-gguf
The app is really fast, but I have to wait for the loading and reloading everytime which is anayoning. I wish if the app can utilize the gpu ram and pc ram as well (other tools already manage it and i can run 34B models ) the reload I guess happenes when I submit the new question, I already reduced the number of gpu layers and I can see that my gpu memory is arround 50% and it crashed in the second question while my gpu ram get freed
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 4.45 GiB (5.68 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 3178.93 MB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/35 layers to GPU
llm_load_tensors: VRAM used: 1382.04 MB
...................................................................................................
[2023-11-14 22:11:25.780] [error] [WASI-NN] GGML backend: Error: prompt too long (524 tokens, max 508)
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: BackendError(InvalidArgument)', src/main.rs:105:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2023-11-14 22:11:25.780] [error] execution failed: unreachable, Code: 0x89
[2023-11-14 22:11:25.780] [error] In instruction: unreachable (0x00) , Bytecode offset: 0x000107ad
[2023-11-14 22:11:25.780] [error] When executing function name: "_start"
You could specify a smaller GPU layers using the -g
in llama-chat.wasm. Perhaps try -g 15
?
Yes, the model reloading is a problem we need to fix. Thank you for reporting this!
wasmedge llama-chat.wasm --help
Usage: llama-chat.wasm [OPTIONS]
Options:
-m, --model-alias <ALIAS>
Model alias [default: default]
-c, --ctx-size <CTX_SIZE>
Size of the prompt context [default: 4096]
-n, --n-predict <N_PRDICT>
Number of tokens to predict [default: 1024]
-g, --n-gpu-layers <N_GPU_LAYERS>
Number of layers to run on the GPU [default: 100]
-b, --batch-size <BATCH_SIZE>
Batch size for prompt processing [default: 4096]
-r, --reverse-prompt <REVERSE_PROMPT>
Halt generation at PROMPT, return control.
-s, --system-prompt <SYSTEM_PROMPT>
System prompt message string [default: "[Default system message for the prompt template]"]
-p, --prompt-template <TEMPLATE>
Prompt template. [default: llama-2-chat] [possible values: llama-2-chat, codellama-instruct, mistral-instruct-v0.1, mistrallite, openchat, belle-llama-2-chat, vicuna-chat, chatml]
--log-prompts
Print prompt strings to stdout
--log-stat
Print statistics to stdout
--log-all
Print all log information to stdout
--stream-stdout
Print the output to stdout in the streaming way
-h, --help
Print help
@eramax you can try to use tiny-llama 1B by quantizing it if you wish to run it on 1050Ti, the model would be smaller and should be able to handle a few turns of conversation. But just like what @juntao mentioned, you will still be limited to a certain threshold of questions that you will be able to ask.
@hydai Can you look into the "reloading after each question" problem? I thought we addressed it on our test machines? Thanks.
Hi @eramax
Due to the workaround for the current WASI-NN spec(the interface we used for this plugin), we will load the model twice.
The first load it without setting the ngl
, and the second reloads it once the ngl
is set. Reloading a model twice may take time.
We know this is expected behavior temporarily, but it's annoying, so we are creating an extension for the current WASI-NN spec to prevent it from happening again. This feature will be shipped in our next release.
Maximizing the leverage of the hardware(e.g., CPU+GPU co-work) is also important to us, we will keep making this happen.
Hi @eramax We've updated the plugin. You can retrieve the latest version by re-running the installer. The new one should solve the reloading issue. However, we are still working on the CPU+GPU co-work. Thanks.
Since there are no more updates, closing this.
With my NVIDIA 1050TI and Ubuntu 23.04 WSL, the model takes a long time to load—about three minutes—but the response is, I believe, faster than with other tools. Once the first question was answered, the program crashed.