I am using --cpu flag onto run the GGML models (specifically WizardLM-30B-Uncensored-GGML) entirely into RAM as I don't have much VRAM.
In the chat and chat-instruct mode the generation is slower. The more prior history the character has, the slower the generation is. Instruct works as expected.
I don't exclude the issues with my local setup, because it used to work normally at some point yesterday but then it just stopped. But I tried reinstalling the UI, model and cleaning local Python cache/re-install Python entirely several times and the issue persists.
May be the expected behaviour but I am not sure.
Is there an existing issue for this?
[X] I have searched the existing issues
Reproduction
Enable chat or chat-instruct mode
Enable any character
Try to generate some responses (using 'Generate' button)
Wait for ~5 mins
Get a response
Screenshot
No response
Logs
/* Loading a model */
INFO:Loading WizardLM-30B-Uncensored.ggmlv3.q4_0.bin...
INFO:llama.cpp weights detected: models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 19756.67 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size = 3120.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 0.29 seconds.
/* Instruct mode */
llama_print_timings: load time = 1732.59 ms
llama_print_timings: sample time = 3.92 ms / 11 runs ( 0.36 ms per token)
llama_print_timings: prompt eval time = 1732.55 ms / 9 tokens ( 192.51 ms per token)
llama_print_timings: eval time = 5947.06 ms / 10 runs ( 594.71 ms per token)
llama_print_timings: total time = 7699.96 ms
Output generated in 7.92 seconds (1.26 tokens/s, 10 tokens, context 10, seed 1180324049)
/* Chat-instruct mode, Chiharu Yamada */
llama_print_timings: load time = 1732.59 ms
llama_print_timings: sample time = 6.28 ms / 17 runs ( 0.37 ms per token)
llama_print_timings: prompt eval time = 65955.69 ms / 409 tokens ( 161.26 ms per token)
llama_print_timings: eval time = 9782.56 ms / 16 runs ( 611.41 ms per token)
llama_print_timings: total time = 75769.95 ms
Output generated in 75.99 seconds (0.21 tokens/s, 16 tokens, context 410, seed 1596456226)
/* Chat mode, Chiharu Yamada (2.6kB logs) */
Llama.generate: prefix-match hit
Output generated in 141.75 seconds (0.82 tokens/s, 116 tokens, context 426, seed 849433334)
/* Chat mode, Chiharu Yamada (regenerate) */
Llama.generate: prefix-match hit
Output generated in 76.48 seconds (1.53 tokens/s, 117 tokens, context 426, seed 486988653)
/* Chat mode, custom character (12.2kB logs) */
Output generated in 343.99 seconds (0.07 tokens/s, 23 tokens, context 1837, seed 1835124699)
/* Chat mode, custom character (regenerate) */
Llama.generate: prefix-match hit
Output generated in 22.20 seconds (1.31 tokens/s, 29 tokens, context 1837, seed 564248536)
Describe the bug
I am using
--cpu
flag onto run the GGML models (specifically WizardLM-30B-Uncensored-GGML) entirely into RAM as I don't have much VRAM. In the chat and chat-instruct mode the generation is slower. The more prior history the character has, the slower the generation is. Instruct works as expected. I don't exclude the issues with my local setup, because it used to work normally at some point yesterday but then it just stopped. But I tried reinstalling the UI, model and cleaning local Python cache/re-install Python entirely several times and the issue persists. May be the expected behaviour but I am not sure.Is there an existing issue for this?
Reproduction
Screenshot
No response
Logs
System Info