Closed katopz closed 6 months ago
Hi @katopz This issue and #64 may be the same one.
As you can see at this line, we define a system prompt by default for the llama2 model. However, it may not work with other models with different prompt schema.
Instead of using this example, please try to change the default prompt or try our another example - llama-utils, which provides various default prompts for several different models.
@katopz As @hydai mentioned, you can use llama-utils/chat to run openhermes-2.5-mistral-7b
model.
The CLI options of llama-chat.wasm
:
ubuntu@ip-172-31-31-132:~/workspace/llama-utils/chat$ wasmedge llama-chat.wasm -h
Usage: llama-chat.wasm [OPTIONS]
Options:
-m, --model-alias <ALIAS>
Model alias [default: default]
-c, --ctx-size <CTX_SIZE>
Size of the prompt context [default: 4096]
-n, --n-predict <N_PRDICT>
Number of tokens to predict [default: 1024]
-g, --n-gpu-layers <N_GPU_LAYERS>
Number of layers to run on the GPU [default: 100]
-b, --batch-size <BATCH_SIZE>
Batch size for prompt processing [default: 4096]
-r, --reverse-prompt <REVERSE_PROMPT>
Halt generation at PROMPT, return control.
-s, --system-prompt <SYSTEM_PROMPT>
System prompt message string [default: "[Default system message for the prompt template]"]
-p, --prompt-template <TEMPLATE>
Prompt template. [default: llama-2-chat] [possible values: llama-2-chat, codellama-instruct, mistral-instruct-v0.1, mistrallite, openchat, belle-llama-2-chat, vicuna-chat, chatml]
--log-prompts
Print prompt strings to stdout
--log-stat
Print statistics to stdout
--log-all
Print all log information to stdout
--stream-stdout
Print the output to stdout in the streaming way
-h, --help
Print help
The command to run the openhermes-2.5-mistral-7b.Q5_K_M.gguf
Notice the -p
and -r
options used in the command below
ubuntu@ip-172-31-31-132:~/workspace/llama-utils/chat$ wasmedge --dir .:. --nn-preload default:GGML:AUTO:openhermes-2.5-mistral-7b.Q5_K_M.gguf llama-chat.wasm -p chatml -r '<|im_end|>'
[INFO] Model alias: default
[INFO] Prompt context size: 4096
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 4096
[INFO] Reverse prompt: <|im_end|>
[INFO] Use default system prompt
[INFO] Prompt template: ChatML
[INFO] Stream stdout: false
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
----------------------------------------------------
[USER]:
what's the capital of France?
[ASSISTANT]:
Paris<|im_end|>
[USER]:
how many planets are in the solar system?
[ASSISTANT]:
8<|im_end|><
[USER]:
what are their names?
[ASSISTANT]:
Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune<|im_end|>
[USER]:
which one is the biggest?
[ASSISTANT]:
Jupiter<|im_end|><
[USER]:
If you'd like to try our dev
branch, you'll see more better results:
ubuntu@ip-172-31-31-132:~/workspace/llama-utils/chat$ wasmedge --dir .:. --nn-preload default:GGML:AUTO:openhermes-2.5-mistral-7b.Q5_K_M.gguf llama-chat.wasm -p chatml -r '<|im_end|>'
[INFO] Model alias: default
[INFO] Prompt context size: 4096
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 4096
[INFO] Reverse prompt: <|im_end|>
[INFO] Use default system prompt
[INFO] Prompt template: ChatML
[INFO] Stream stdout: false
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
----------------------------------------------------
[USER]:
what's the capital of France?
[ASSISTANT]:
Paris
[USER]:
what's the distance between Beijing and Tokyo?
[ASSISTANT]:
Approximately 1,800 kilometers (1,118 miles)
[USER]:
how many planets in the solar system?
[ASSISTANT]:
Eight planets in the solar system: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.
[USER]:
which one is the biggest?
[ASSISTANT]:
Jupiter is the largest planet in the solar system.
[USER]:
Cool thank, it's working! Anyway I can see the memory slightly stack up for each infer input in main branch (say hi 4 times).
Not sure is this normal?
@katopz Could you please provide the environment info, including CPU, GPU (if possible), and memory. You're using the 'main' branch of llama-chat
to do the memory test, right?
@katopz Could you please provide the environment info, including CPU, GPU (if possible), and memory. You're using the 'main' branch of
llama-chat
to do the memory test, right?
Yes, and pretty same on dev
8461MiB 8473MiB 8491MiB 8513MiB
nvidia-smi
Sun Nov 12 22:05:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 546.01 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | Off |
| 0% 40C P5 44W / 450W | 8513MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 512 G /Xwayland N/A |
| 0 N/A N/A 62004 C /wasmedge N/A |
+-----------------------------------------------------------------------------+
GPU NVIDIA GeForce RTX 4090 24GB
CPU 13th Gen Intel(R) Core(TM) i7-13700K, 3400 Mhz, 16 Core(s), 24 Logical Processor(s)
RAM DDR5 32GB BUS 5200 KINGSTON FURY BEAST BLACK
SSD 1TB SAMSUNG 980 PRO (R 7,000MB/s,W 5,000MB/s) M.2 NVMe
OS Windows 11 WSL2 Ubuntu 22.04
My guess is that the slight memory increase is normal since the LLM is stateless. With every new turn in the conversation, it needs to process more history. So, it should consume more memory and respond slower with each new question in the conversation -- until the context window size is reached and it starts to "forget" earlier conversations.
My guess is that the slight memory increase is normal since the LLM is stateless. With every new turn in the conversation, it needs to process more history. So, it should consume more memory and respond slower with each new question in the conversation -- until the context window size is reached and it starts to "forget" earlier conversations.
Cool, In that case some test will need for that because it's crucial for production use and to set the memory used baseline.
Anyway i will close this issue for now because it's already off topic. Thanks!
Always get an answer
for OpenHermes-2.5-Mistral-7B-GPTQ first question.
To run this code, you'll need to have Rust installed on your system. You can install it from the official Rust website: <https://www.rust-lang.org/tools/install.
llama_print_timings: load time = 654.03 ms llama_print_timings: sample time = 3.43 ms / 121 runs ( 0.03 ms per token, 35307.85 tokens per second) llama_print_timings: prompt eval time = 180.11 ms / 119 tokens ( 1.51 ms per token, 660.70 tokens per second) llama_print_timings: eval time = 1108.10 ms / 119 runs ( 9.31 ms per token, 107.39 tokens per second) llama_print_timings: total time = 11571.70 ms
Question: