nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
9.06k stars 709 forks source link

Caching doesn't work #243

Open nameiwillforget opened 4 months ago

nameiwillforget commented 4 months ago

Caching doesn't work on either of my laptop or desktop Arch systems:

[alex@Arch ~]$ sh wizardcoder-python-34b-v1.0.Q5_K_M.llamafile --prompt-cache ~/wizard/newfile -f wizard/test.el.prompt 
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
Log start
main: llamafile version 0.6.2
main: seed  = 1708969571
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = wizardlm_wizardcoder-python-34b-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 22016
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32001]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32001]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32001]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 260/32001 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32001
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 22016
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.74 B
llm_load_print_meta: model size       = 22.20 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = wizardlm_wizardcoder-python-34b-v1.0
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.17 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/49 layers to GPU
llm_load_tensors:        CPU buffer size = 22733.75 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    48.07 MiB
llama_new_context_with_model:        CPU compute buffer size =  2305.60 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 12 / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from '/home/alex/wizard/newfile'
main: session file does not exist, will create
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 16384, n_batch = 512, n_predict = -1, n_keep = 0

 [INST]You are an Emacs code generator. Writing comments is forbidden. Writing test code is forbidden. Writing English explanations is forbidden. Generate el code to complete:[/INST]
```el
(defconst all-greek-capital-letters )libc++abi: terminating due to uncaught exception of type std::runtime_error: failed to open /home/alex/wizard/newfile: Bad file number

error: Uncaught SIGABRT (SI_TKILL) at 0x3e80004ede4 on Arch pid 323044 tid 323044
  /home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile
  Bad file number
  Linux Cosmopolitan 3.2.4 MODE=x86_64; #1 SMP PREEMPT_DYNAMIC Sat, 23 Sep 2023 22:55:13 +0000 Arch 6.5.5-arch1-1

RAX 0000000000000000 RBX 000010008004a750 RDI 000000000004ede4
RCX 00000000006b3096 RDX 0000000000000000 RSI 0000000000000006
RBP 00007ffdcb0ebde0 RSP 00007ffdcb0ebde0 RIP 00000000006b3096
 R8 0000000000000000  R9 0000000000000002 R10 00000000006b3096
R11 0000000000000296 R12 0000000000000006 R13 0000000000675cd0
R14 00000000006fe348 R15 00007ffdcb0ee140
TLS 0000000000746340

XMM0  00000000000000000000000000000000 XMM8  00000000000000000000000000000000
XMM1  00000000000000000000000000000000 XMM9  00000000000000000000000000000000
XMM2  0000000000000000000000000082b448 XMM10 00000000000000000000000000000000
XMM3  2f206e65706f206f742064656c696166 XMM11 00000000000000000000000000000000
XMM4  0000a5e1ffffa1000000894e00003f16 XMM12 00000000000000000000000000000000
XMM5  000074cb0000025700000c03000006d9 XMM13 00000000000000000000000000000000
XMM6  000000008f83f4cc000010020095a540 XMM14 00000000000000000000000000000000
XMM7  00000000000002000000000000000004 XMM15 00000000000000000000000000000000

cosmoaddr2line /home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile 6b3096 6a8331 4179a8 6bc040 6bc1c2 688997 6883c9 5ad927 406108 413733 401604

note: won't print addr2line backtrace because pledge
7ffdcb0e8c30 6b3096 systemfive_linux+31
7ffdcb0ebde0 6a8331 raise+113
7ffdcb0ebe00 4179a8 abort+45
7ffdcb0ebe20 6bc040 NULL+0
7ffdcb0ebf10 6bc1c2 _ZL28demangling_terminate_handlerv+338
7ffdcb0ebfc0 688997 _ZSt11__terminatePFvvE+71
7ffdcb0ec040 6883c9 NULL+0
7ffdcb0ec070 5ad927 llama_save_session_file+5015
7ffdcb0ec360 406108 main+18440
7ffdcb0ee020 413733 cosmo+77
7ffdcb0ee030 401604 _start+133

10008004-10008011 rw-pa-      14x automap 896kB w/ 896kB hole
10008020-10008068 rw-pa-      73x automap 4672kB w/ 1472kB hole
10008080-100080b3 rw-pa-      52x automap 3328kB w/ 768kB hole
100080c0-100080d7 rw-pa-      24x automap 1536kB w/ 512kB hole
100080e0-10008100 rw-pa-      33x automap 2112kB w/ 960kB hole
10008110-10008127 rw-pa-      24x automap 1536kB w/ 14mB hole
10008200-100085e8 rw-pa-   1'001x automap 63mB w/ 1472kB hole
10008600-10008901 rw-pa-     770x automap 48mB w/ 1904mB hole
10010000-1001c000 rw-pa-  49'153x automap 3072mB w/ 1024mB hole
10020000-10029019 rw-pa-  36'890x automap 2306mB w/ 5892mB hole
10040060-10098eec r--s-- 364'173x automap 22gB w/ 10gB hole
100c0000-10118ce7 r--s-- 363'752x automap 22gB w/ 96tB hole
6fd00004-6fd00004 rw-paF       1x zipos 64kB w/ 64gB hole
6fe00004-6fe00004 rw-paF       1x g_fds 64kB
# 50gB total mapped memory
/home/alex/.local/bin/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile -m wizardcoder-python-34b-v1.0.Q5_K_M.gguf -c 0 --prompt-cache /home/alex/wizard/newfile -f wizard/test.el.prompt 
Aborted (core dumped)

From then on, if I try to use this, now existing, cache file:

[alex@Arch ~]$ sh wizardcoder-python-34b-v1.0.Q5_K_M.llamafile --prompt-cache ~/wizard/newfile -f wizard/test.el.prompt 
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
Log start
main: llamafile version 0.6.2
main: seed  = 1708969624
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from wizardcoder-python-34b-v1.0.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = wizardlm_wizardcoder-python-34b-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 22016
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32001]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32001]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32001]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 260/32001 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32001
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 22016
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.74 B
llm_load_print_meta: model size       = 22.20 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = wizardlm_wizardcoder-python-34b-v1.0
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.17 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/49 layers to GPU
llm_load_tensors:        CPU buffer size = 22733.75 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    48.07 MiB
llama_new_context_with_model:        CPU compute buffer size =  2305.60 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 12 / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from '/home/alex/wizard/newfile'
error loading session file: failed to open /home/alex/wizard/newfile: I/O error
main: error: failed to load session file '/home/alex/wizard/newfile'

The same prompt works if used without the cache flag.