ggml_allocr_alloc: not enough space in the buffer

slavag commented 9 months ago

Hi, After I updated h2ogpt code to latest (commit 6a7283eb66096d188f796760f58680c1d9c16dbc) I started to get ggml_allocr_alloc: not enough space in the buffer , using Nvidia A10 with 24GB VRAM. It was working fine, I updated code of h2ogpt yesterday and everything was fine, after starting VRAM was utilized by 67% only.

Startup log :

python generate.py --base_model=llama --prompt_type=llama2 --model_path_llama=/mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf  --allow_upload_to_user_data=False --langchain_modes="[ZenDeskTicketsWithDocs, MyData]" --langchain_mode=ZenDeskTicketsWithDocs --langchain_mode_types="{'ZenDeskTicketsWithDocs':'shared'}" --visible_side_bar=False --visible_doc_selection_tab=False --h2ocolors=False --use_llm_if_no_docs=False --max_seq_len=4096  --top_k_docs=-1 --temperature=0 --top_p=0.01 --top_k=2
Using Model llama
load INSTRUCTOR_Transformer
max_seq_length  512
Starting get_model: llama 
/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A10G, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    8:              blk.0.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   10:           blk.1.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   17:              blk.1.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   19:          blk.10.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   20:           blk.10.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   21:           blk.10.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   22:             blk.10.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   23:           blk.10.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   24:             blk.10.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   25:        blk.10.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   26:             blk.10.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   27:             blk.10.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   28:          blk.11.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   29:           blk.11.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   30:           blk.11.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   31:             blk.11.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   32:           blk.11.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   33:             blk.11.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   34:        blk.11.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   35:             blk.11.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   36:             blk.11.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   37:          blk.12.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   38:           blk.12.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   39:           blk.12.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   40:             blk.12.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   41:           blk.12.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   42:             blk.12.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   43:        blk.12.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   44:             blk.12.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   45:             blk.12.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   46:          blk.13.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   47:           blk.13.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   48:           blk.13.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   49:             blk.13.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   50:           blk.13.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   51:             blk.13.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   52:        blk.13.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   53:             blk.13.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   54:             blk.13.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   55:          blk.14.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   56:           blk.14.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   57:           blk.14.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   58:             blk.14.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   59:           blk.14.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   60:             blk.14.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   61:        blk.14.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   62:             blk.14.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   63:             blk.14.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   64:             blk.15.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   65:             blk.15.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   66:           blk.2.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   67:            blk.2.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   68:            blk.2.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   69:              blk.2.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   70:            blk.2.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   71:              blk.2.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   72:         blk.2.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   73:              blk.2.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   74:              blk.2.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   75:           blk.3.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   76:            blk.3.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   77:            blk.3.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   78:              blk.3.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   79:            blk.3.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   80:              blk.3.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   81:         blk.3.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   82:              blk.3.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   83:              blk.3.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   84:           blk.4.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   85:            blk.4.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   86:            blk.4.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   87:              blk.4.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   88:            blk.4.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   89:              blk.4.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   90:         blk.4.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   91:              blk.4.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   92:              blk.4.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   93:           blk.5.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   94:            blk.5.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   95:            blk.5.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   96:              blk.5.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   97:            blk.5.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   98:              blk.5.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   99:         blk.5.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  100:              blk.5.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  101:              blk.5.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  102:           blk.6.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  103:            blk.6.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  104:            blk.6.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  105:              blk.6.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  106:            blk.6.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  107:              blk.6.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  108:         blk.6.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  109:              blk.6.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  110:              blk.6.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  111:           blk.7.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  112:            blk.7.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  113:            blk.7.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  114:              blk.7.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  115:            blk.7.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  116:              blk.7.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  117:         blk.7.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  118:              blk.7.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  119:              blk.7.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  120:           blk.8.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  121:            blk.8.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  122:            blk.8.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  123:              blk.8.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  124:            blk.8.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  125:              blk.8.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  126:         blk.8.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  127:              blk.8.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  128:              blk.8.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  129:           blk.9.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  130:            blk.9.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  131:            blk.9.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  132:              blk.9.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  133:            blk.9.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  134:              blk.9.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  135:         blk.9.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  136:              blk.9.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  137:              blk.9.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  138:          blk.15.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  139:           blk.15.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  143:        blk.15.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  145:          blk.16.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.16.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  147:           blk.16.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  151:        blk.16.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  152:             blk.16.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  154:          blk.17.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.17.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  156:           blk.17.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  160:        blk.17.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  161:             blk.17.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  163:          blk.18.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.18.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  165:           blk.18.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  169:        blk.18.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  170:             blk.18.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  172:          blk.19.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.19.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  174:           blk.19.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  178:        blk.19.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  179:             blk.19.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  181:          blk.20.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  182:           blk.20.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  183:           blk.20.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  187:        blk.20.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  188:             blk.20.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  190:          blk.21.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  191:           blk.21.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  192:           blk.21.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  196:        blk.21.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  197:             blk.21.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  199:          blk.22.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  200:           blk.22.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  201:           blk.22.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  205:        blk.22.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  206:             blk.22.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  208:          blk.23.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  209:           blk.23.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  210:           blk.23.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  214:        blk.23.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  215:             blk.23.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  217:          blk.24.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  218:           blk.24.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  219:           blk.24.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  223:        blk.24.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  224:             blk.24.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  226:          blk.25.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  227:           blk.25.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  228:           blk.25.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  232:        blk.25.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  233:             blk.25.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  235:          blk.26.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  236:           blk.26.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  237:           blk.26.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  241:        blk.26.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  242:             blk.26.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  244:          blk.27.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  245:           blk.27.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  246:           blk.27.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  250:        blk.27.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  251:             blk.27.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  253:          blk.28.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  254:           blk.28.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  255:           blk.28.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  259:        blk.28.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  260:             blk.28.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  262:          blk.29.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  263:           blk.29.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  264:           blk.29.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  268:        blk.29.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  269:             blk.29.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  271:           blk.30.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  277:                    output.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  280:           blk.30.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  281:          blk.31.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  282:           blk.31.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  283:           blk.31.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  285:           blk.31.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  287:        blk.31.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  290:          blk.32.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  291:           blk.32.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  292:           blk.32.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  293:             blk.32.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  294:           blk.32.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  295:             blk.32.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  296:        blk.32.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  297:             blk.32.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  298:             blk.32.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  299:          blk.33.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  300:           blk.33.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  301:           blk.33.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  302:             blk.33.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  303:           blk.33.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  304:             blk.33.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  305:        blk.33.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  306:             blk.33.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  307:             blk.33.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  308:          blk.34.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  309:           blk.34.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  310:           blk.34.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  311:             blk.34.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  312:           blk.34.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  313:             blk.34.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  314:        blk.34.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  315:             blk.34.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  316:             blk.34.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  317:          blk.35.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  318:           blk.35.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  319:           blk.35.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  320:             blk.35.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  321:           blk.35.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  322:             blk.35.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  323:        blk.35.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  324:             blk.35.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  325:             blk.35.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  326:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  327:           blk.36.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  328:           blk.36.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  329:             blk.36.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  330:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  331:             blk.36.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  332:        blk.36.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  333:             blk.36.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  334:             blk.36.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  335:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  336:           blk.37.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  337:           blk.37.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  338:             blk.37.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  339:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  340:             blk.37.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  341:        blk.37.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  342:             blk.37.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  343:             blk.37.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  344:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  345:           blk.38.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  346:           blk.38.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  347:             blk.38.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  348:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  349:             blk.38.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  350:        blk.38.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  351:             blk.38.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  352:             blk.38.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  353:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  354:           blk.39.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  355:           blk.39.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  356:             blk.39.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  357:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  358:             blk.39.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  359:        blk.39.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  360:             blk.39.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  361:             blk.39.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  362:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q6_K:  282 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q6_K
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  128.29 MB (+ 3200.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 13256 MB
....................................................................................................
llama_new_context_with_model: kv self size  = 3200.00 MB
llama_new_context_with_model: compute buffer total size =  351.47 MB
llama_new_context_with_model: VRAM scratch buffer: 350.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': ''}, 'visible_models': None, 'h2ogpt_key': None, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'load_gptq': '', 'load_exllama': False, 'use_safetensors': False, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': True, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': '/mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}, 'model_path_llama': '/mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}
load INSTRUCTOR_Transformer
max_seq_length  512
/mnt/AI/h2ogpt/src/gradio_runner.py:732: GradioUnusedKwargWarning: You have unused kwarg parameters in Dropdown, please remove them: {'filterable': False}
  langchain_agents = gr.Dropdown(
/mnt/AI/h2ogpt/src/gradio_runner.py:836: GradioUnusedKwargWarning: You have unused kwarg parameters in Dropdown, please remove them: {'filterable': False}
  visible_models = gr.Dropdown(kwargs['all_models'],
Running on local URL:  http://0.0.0.0:7860

Error log :

ggml_allocr_alloc: not enough space in the buffer (needed 335544320, largest block available 304085024)
GGML_ASSERT: /tmp/pip-install-lq9gmmv7/llama-cpp-python_511fca8a040049ac9ab4298ec532185f/vendor/llama.cpp/ggml-alloc.c:173: !"not enough space in the buffer"
Fatal Python error: Aborted

Current thread 0x00007f781d6ff700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama_cpp.py", line 808 in llama_eval
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama.py", line 500 in eval
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama.py", line 777 in generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama.py", line 957 in _create_completion
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/llamacpp.py", line 288 in _stream
  File "/mnt/AI/h2ogpt/src/gpt4all_llm.py", line 399 in _stream
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 341 in stream
  File "/mnt/AI/h2ogpt/src/gpt4all_llm.py", line 365 in _call
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 961 in _generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 475 in _generate_helper
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 582 in generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 451 in generate_prompt
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/llm.py", line 102 in generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/llm.py", line 92 in _call
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/base.py", line 252 in __call__
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/llm.py", line 252 in predict
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/combine_documents/stuff.py", line 165 in combine_docs
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/combine_documents/base.py", line 106 in _call
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/base.py", line 252 in __call__
  File "/mnt/AI/h2ogpt/src/utils.py", line 412 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f7874ca4700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/concurrent/futures/thread.py", line 81 in _worker
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 975 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f78757b5700 (most recent call first):
  File "/mnt/AI/h2ogpt/src/utils_langchain.py", line 67 in __next__
  File "/mnt/AI/h2ogpt/src/gpt_langchain.py", line 3814 in _run_qa_db
  File "/mnt/AI/h2ogpt/src/gen.py", line 2387 in evaluate
  File "/mnt/AI/h2ogpt/src/gradio_runner.py", line 3015 in get_response
  File "/mnt/AI/h2ogpt/src/gradio_runner.py", line 3066 in bot
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/gradio/utils.py", line 695 in gen_wrapper
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/gradio/utils.py", line 326 in run_sync_iterator_async
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f788905a700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/asyncio/runners.py", line 118 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/asyncio/runners.py", line 190 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/uvicorn/server.py", line 61 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 975 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f7889a5b700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 324 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 622 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/apscheduler/schedulers/blocking.py", line 30 in _main_loop
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 975 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f7899fff700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 324 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 622 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f78a2a84700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 324 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/queue.py", line 180 in get
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/posthog/consumer.py", line 104 in next
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/posthog/consumer.py", line 73 in upload
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/posthog/consumer.py", line 62 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007f79b0fd1740 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/gradio/blocks.py", line 2202 in block_thread
  File "/mnt/AI/h2ogpt/src/gradio_runner.py", line 4183 in go_gradio
  File "/mnt/AI/h2ogpt/src/gen.py", line 1275 in main
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/fire/core.py", line 691 in _CallAndUpdateTrace
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/fire/core.py", line 475 in _Fire
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/fire/core.py", line 141 in Fire
  File "/mnt/AI/h2ogpt/src/utils.py", line 64 in H2O_Fire
  File "/mnt/AI/h2ogpt/generate.py", line 12 in entrypoint_main
  File "/mnt/AI/h2ogpt/generate.py", line 16 in <module>

Extension modules: simplejson._speedups, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslib, pandas._libs.ops, numexpr.interpreter, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, lz4._version, lz4.frame._frame, psutil._psutil_linux, psutil._psutil_posix, matplotlib._c_internal_utils, PIL._imaging, matplotlib._path, kiwisolver._cext, matplotlib._image, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, yaml._yaml, sentencepiece._sentencepiece, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, sqlalchemy.cyextension.collections, sqlalchemy.cyextension.immutabledict, sqlalchemy.cyextension.processors, sqlalchemy.cyextension.resultproxy, sqlalchemy.cyextension.util, greenlet._greenlet, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, regex._regex, sklearn.__check_build._check_build, sklearn.utils.murmurhash, sklearn.utils._isfinite, sklearn.utils._openmp_helpers, sklearn.utils._vector_sentinel, sklearn.feature_extraction._hashing_fast, sklearn.utils._logistic_sigmoid, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.utils._cython_blas, sklearn.svm._libsvm, sklearn.svm._liblinear, sklearn.svm._libsvm_sparse, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.utils.arrayfuncs, sklearn.utils._typedefs, sklearn.utils._readonly_array_wrapper, sklearn.metrics._dist_metrics, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_fast, sklearn.linear_model._cd_fast, sklearn._loss._loss, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, sklearn.linear_model._sag_fast, sklearn.datasets._svmlight_format_fast, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, google._upb._message, zstandard.backend_c, websockets.speedups, ujson, markupsafe._speedups, PIL._webp, uvloop.loop, httptools.parser.parser, httptools.parser.url_parser (total: 262)
Aborted

slavag commented 9 months ago

Reverting to commit c38c19e507120eee8b06a2cfe4042c03f5610735 didn't solve the issue Reverting to commit cb09dab183b36c1c49d21a3efc5ad7dc427278bc solved the issue

pseudotensor commented 9 months ago

I suspect you are running out of memory after this change: https://github.com/h2oai/h2ogpt/commit/c38c19e507120eee8b06a2cfe4042c03f5610735

You can pass --max_seq_len=2048 if you want to go back to old (wrong) behavior in order to use less memory.

slavag commented 9 months ago

@pseudotensor why reverting the code to commit cb09dab183b36c1c49d21a3efc5ad7dc427278bc solved the issue ?

pseudotensor commented 9 months ago

@slavag It can't be true, because that has only to do with OpenAI code and is very well isolated to that.

slavag commented 9 months ago

@pseudotensor but it's a fact, I can take recent code again and check and then to revert to c38c19e507120eee8b06a2cfe4042c03f5610735. But I already did this. reverting back, solving the issue.

pseudotensor commented 9 months ago

Sorry, I can't follow what you are saying. https://github.com/h2oai/h2ogpt/commit/c38c19e507120eee8b06a2cfe4042c03f5610735 can matter and is likely the issue but https://github.com/h2oai/h2ogpt/commit/cb09dab183b36c1c49d21a3efc5ad7dc427278bc cannot matter. You must be making an error when checking things. You can see the code only has to do with openai.

But again, it's not just about reverting some commit. The commit https://github.com/h2oai/h2ogpt/commit/c38c19e507120eee8b06a2cfe4042c03f5610735 is not wrong. One needs to reduce the max_seq_len

slavag commented 9 months ago

@pseudotensor sorry for confusing you, but with : HEAD is now at cb09dab1 Fixes #928 command h2ogpt]$ python generate.py --base_model=llama --prompt_type=llama2 --model_path_llama=/mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf --allow_upload_to_user_data=False --langchain_modes="[ZenDeskTicketsWithDocs, MyData]" --langchain_mode=ZenDeskTicketsWithDocs --langchain_mode_types="{'ZenDeskTicketsWithDocs':'shared'}" --visible_side_bar=False --visible_doc_selection_tab=False --h2ocolors=False --use_llm_if_no_docs=False --max_seq_len=4096 --top_k_docs=-1 --temperature=0 --top_p=0.01 --top_k=2 is working. And when I'm doing HEAD is now at c38c19e5 Fixes #915 it's failing, same execution command.

slavag commented 9 months ago

@pseudotensor changed to 2048 , same issue, with lates code. reverting back to cb09dab183b36c1c49d21a3efc5ad7dc427278bc - no issue. And, btw there 10GB of VRAM available after starting the chat.

pseudotensor commented 9 months ago

Probably a language issue. When you said before "revert" that normally means to do:

git revert <hash>

I think instead you are doing:

git checkout <hash>

?

pseudotensor commented 9 months ago

FYI if I run your command (just changing to use url instead of path and no score model) on main I see no issues:

(h2ogpt) jon@pseudotensor:~/h2ogpt$ CUDA_VISIBLE_DEVICES=0 python generate.py --base_model=llama --prompt_type=llama2 --model_path_llama=https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q6_K.gguf  --allow_upload_to_user_data=False --langchain_modes="[ZenDeskTicketsWithDocs, MyData]" --langchain_mode=ZenDeskTicketsWithDocs --langchain_mode_types="{'ZenDeskTicketsWithDocs':'shared'}" --visible_side_bar=False --visible_doc_selection_tab=False --h2ocolors=False --use_llm_if_no_docs=False --max_seq_len=4096  --top_k_docs=-1 --temperature=0 --top_p=0.01 --top_k=2 --score_model=None 
Using Model llama
Starting get_model: llama 
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from llama-2-13b-chat.Q6_K.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    8:              blk.0.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   10:           blk.1.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   17:              blk.1.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   19:          blk.10.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   20:           blk.10.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   21:           blk.10.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   22:             blk.10.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   23:           blk.10.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   24:             blk.10.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   25:        blk.10.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   26:             blk.10.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   27:             blk.10.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   28:          blk.11.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   29:           blk.11.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   30:           blk.11.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   31:             blk.11.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   32:           blk.11.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   33:             blk.11.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   34:        blk.11.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   35:             blk.11.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   36:             blk.11.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   37:          blk.12.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   38:           blk.12.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   39:           blk.12.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   40:             blk.12.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   41:           blk.12.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   42:             blk.12.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   43:        blk.12.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   44:             blk.12.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   45:             blk.12.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   46:          blk.13.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   47:           blk.13.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   48:           blk.13.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   49:             blk.13.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   50:           blk.13.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   51:             blk.13.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   52:        blk.13.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   53:             blk.13.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   54:             blk.13.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   55:          blk.14.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   56:           blk.14.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   57:           blk.14.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   58:             blk.14.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   59:           blk.14.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   60:             blk.14.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   61:        blk.14.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   62:             blk.14.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   63:             blk.14.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   64:             blk.15.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   65:             blk.15.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   66:           blk.2.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   67:            blk.2.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   68:            blk.2.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   69:              blk.2.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   70:            blk.2.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   71:              blk.2.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   72:         blk.2.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   73:              blk.2.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   74:              blk.2.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   75:           blk.3.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   76:            blk.3.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   77:            blk.3.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   78:              blk.3.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   79:            blk.3.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   80:              blk.3.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   81:         blk.3.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   82:              blk.3.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   83:              blk.3.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   84:           blk.4.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   85:            blk.4.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   86:            blk.4.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   87:              blk.4.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   88:            blk.4.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   89:              blk.4.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   90:         blk.4.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   91:              blk.4.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   92:              blk.4.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   93:           blk.5.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   94:            blk.5.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   95:            blk.5.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   96:              blk.5.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   97:            blk.5.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   98:              blk.5.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   99:         blk.5.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  100:              blk.5.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  101:              blk.5.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  102:           blk.6.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  103:            blk.6.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  104:            blk.6.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  105:              blk.6.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  106:            blk.6.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  107:              blk.6.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  108:         blk.6.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  109:              blk.6.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  110:              blk.6.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  111:           blk.7.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  112:            blk.7.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  113:            blk.7.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  114:              blk.7.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  115:            blk.7.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  116:              blk.7.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  117:         blk.7.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  118:              blk.7.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  119:              blk.7.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  120:           blk.8.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  121:            blk.8.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  122:            blk.8.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  123:              blk.8.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  124:            blk.8.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  125:              blk.8.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  126:         blk.8.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  127:              blk.8.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  128:              blk.8.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  129:           blk.9.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  130:            blk.9.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  131:            blk.9.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  132:              blk.9.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  133:            blk.9.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  134:              blk.9.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  135:         blk.9.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  136:              blk.9.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  137:              blk.9.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  138:          blk.15.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  139:           blk.15.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  143:        blk.15.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  145:          blk.16.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.16.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  147:           blk.16.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  151:        blk.16.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  152:             blk.16.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  154:          blk.17.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.17.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  156:           blk.17.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  160:        blk.17.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  161:             blk.17.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  163:          blk.18.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.18.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  165:           blk.18.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  169:        blk.18.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  170:             blk.18.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  172:          blk.19.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.19.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  174:           blk.19.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  178:        blk.19.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  179:             blk.19.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  181:          blk.20.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  182:           blk.20.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  183:           blk.20.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  187:        blk.20.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  188:             blk.20.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  190:          blk.21.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  191:           blk.21.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  192:           blk.21.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  196:        blk.21.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  197:             blk.21.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  199:          blk.22.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  200:           blk.22.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  201:           blk.22.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  205:        blk.22.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  206:             blk.22.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  208:          blk.23.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  209:           blk.23.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  210:           blk.23.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  214:        blk.23.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  215:             blk.23.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  217:          blk.24.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  218:           blk.24.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  219:           blk.24.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  223:        blk.24.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  224:             blk.24.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  226:          blk.25.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  227:           blk.25.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  228:           blk.25.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  232:        blk.25.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  233:             blk.25.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  235:          blk.26.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  236:           blk.26.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  237:           blk.26.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  241:        blk.26.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  242:             blk.26.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  244:          blk.27.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  245:           blk.27.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  246:           blk.27.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  250:        blk.27.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  251:             blk.27.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  253:          blk.28.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  254:           blk.28.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  255:           blk.28.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  259:        blk.28.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  260:             blk.28.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  262:          blk.29.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  263:           blk.29.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  264:           blk.29.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  268:        blk.29.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  269:             blk.29.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  271:           blk.30.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  277:                    output.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  280:           blk.30.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  281:          blk.31.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  282:           blk.31.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  283:           blk.31.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  285:           blk.31.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  287:        blk.31.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  290:          blk.32.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  291:           blk.32.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  292:           blk.32.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  293:             blk.32.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  294:           blk.32.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  295:             blk.32.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  296:        blk.32.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  297:             blk.32.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  298:             blk.32.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  299:          blk.33.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  300:           blk.33.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  301:           blk.33.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  302:             blk.33.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  303:           blk.33.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  304:             blk.33.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  305:        blk.33.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  306:             blk.33.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  307:             blk.33.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  308:          blk.34.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  309:           blk.34.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  310:           blk.34.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  311:             blk.34.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  312:           blk.34.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  313:             blk.34.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  314:        blk.34.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  315:             blk.34.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  316:             blk.34.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  317:          blk.35.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  318:           blk.35.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  319:           blk.35.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  320:             blk.35.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  321:           blk.35.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  322:             blk.35.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  323:        blk.35.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  324:             blk.35.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  325:             blk.35.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  326:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  327:           blk.36.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  328:           blk.36.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  329:             blk.36.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  330:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  331:             blk.36.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  332:        blk.36.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  333:             blk.36.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  334:             blk.36.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  335:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  336:           blk.37.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  337:           blk.37.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  338:             blk.37.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  339:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  340:             blk.37.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  341:        blk.37.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  342:             blk.37.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  343:             blk.37.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  344:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  345:           blk.38.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  346:           blk.38.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  347:             blk.38.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  348:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  349:             blk.38.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  350:        blk.38.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  351:             blk.38.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  352:             blk.38.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  353:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  354:           blk.39.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  355:           blk.39.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  356:             blk.39.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  357:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  358:             blk.39.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  359:        blk.39.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  360:             blk.39.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  361:             blk.39.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  362:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q6_K:  282 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q6_K
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  128.29 MB (+ 3200.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 13256 MB
warning: failed to mlock 134402048-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
....................................................................................................
llama_new_context_with_model: kv self size  = 3200.00 MB
llama_new_context_with_model: compute buffer total size =  351.47 MB
llama_new_context_with_model: VRAM scratch buffer: 350.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': ''}, 'visible_models': None, 'h2ogpt_key': None, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'load_gptq': '', 'load_exllama': False, 'use_safetensors': False, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': True, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': 'https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q6_K.gguf', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}, 'model_path_llama': 'https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q6_K.gguf', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}
load INSTRUCTOR_Transformer
max_seq_length  512
Running on local URL:  http://0.0.0.0:7861

To create a public link, set `share=True` in `launch()`.
Started Gradio Server and/or GUI: server_name: 0.0.0.0 port: None

memory usage just after startup is 15GB:

jon@pseudotensor:~/h2ogpt$ nvidia-smi     
Sat Oct  7 13:05:51 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti      On | 00000000:01:00.0 Off |                  Off |
| 31%   57C    P2              114W / 480W|  16611MiB / 24564MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080         On | 00000000:03:00.0 Off |                  N/A |
| 40%   47C    P8                7W / 215W|     10MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1615      G   /usr/lib/xorg/Xorg                          156MiB |
|    0   N/A  N/A      2258      G   /usr/lib/xorg/Xorg                         1190MiB |
|    0   N/A  N/A      2391      G   /usr/bin/gnome-shell                        145MiB |
|    0   N/A  N/A      5455      G   /usr/bin/nvidia-settings                      0MiB |
|    0   N/A  N/A      5940      G   ...2605455,14348786443269456662,262144      301MiB |
|    0   N/A  N/A      6976      G   gnome-control-center                          4MiB |
|    0   N/A  N/A      8291      G   ...ures=SpareRendererForSitePerProcess       49MiB |
|    0   N/A  N/A     37837      C   python                                    14738MiB |
|    1   N/A  N/A      1615      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      2258      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

pseudotensor commented 9 months ago

If you "revert to" (i.e. checkout) https://github.com/h2oai/h2ogpt/commit/cb09dab183b36c1c49d21a3efc5ad7dc427278bc that is just before c38c19e507120eee8b06a2cfe4042c03f5610735 was added, which is the commit I said is the likely issue.

However, if you can easy reproduce the problem, better you isolate exactly which commit it is. This is easy, just do:

git checkout main  # ensure at top of main
git bisect start
# run command and confirm bad
git bisect bad
# run command after git checkout cb09dab183b36c1c49d21a3efc5ad7dc427278bc
# assuming still good, do:
git bisect good
# run command
# repeat until it says which was the bad commit

Thanks

slavag commented 9 months ago

@pseudotensor will try to isolate, but when I say revert, I did git reset - -hard <hash>. Thanks

slavag commented 9 months ago

@pseudotensor this is what I got as a bad commit

c38c19e507120eee8b06a2cfe4042c03f5610735 is the first bad commit
commit c38c19e507120eee8b06a2cfe4042c03f5610735
Author: Jonathan C. McKinney <pseudotensor@gmail.com>
Date:   Fri Oct 6 13:58:17 2023 -0700

    Fixes #915

 src/gpt4all_llm.py | 2 +-
 src/utils.py       | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

pseudotensor commented 9 months ago

@slavag Thanks, yes that's the exact commit I mentioned at first. If you look at the code, I don't understand why passing --max_seq_len=2048 wouldn't make it back to exactly like before.

I confirmed manually that at a breakpoint at that point in the code passing --max_seq_len=2048 results in the input to model_max_length=2048, which is the default before.

slavag commented 9 months ago

@pseudotensor Well, have no idea, but changing to 2048 fails also, you can see llm_load_print_meta: n_ctx = 2048 Execution command : (python311) [ec2-user@ip-10-0-204-192 h2ogpt]$ python generate.py --base_model=llama --prompt_type=llama2 --model_path_llama=/mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf --allow_upload_to_user_data=False --langchain_modes="[ZenDeskTicketsWithDocs, MyData]" --langchain_mode=ZenDeskTicketsWithDocs --langchain_mode_types="{'ZenDeskTicketsWithDocs':'shared'}" --visible_side_bar=False --visible_doc_selection_tab=False --h2ocolors=False --use_llm_if_no_docs=False --max_seq_len=2048 --top_k_docs=-1 --temperature=0 --top_p=0.01 --top_k=2

Using Model llama
load INSTRUCTOR_Transformer
max_seq_length  512
Starting get_model: llama 
/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A10G, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    8:              blk.0.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   10:           blk.1.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   17:              blk.1.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   19:          blk.10.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   20:           blk.10.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   21:           blk.10.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   22:             blk.10.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   23:           blk.10.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   24:             blk.10.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   25:        blk.10.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   26:             blk.10.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   27:             blk.10.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   28:          blk.11.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   29:           blk.11.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   30:           blk.11.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   31:             blk.11.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   32:           blk.11.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   33:             blk.11.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   34:        blk.11.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   35:             blk.11.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   36:             blk.11.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   37:          blk.12.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   38:           blk.12.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   39:           blk.12.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   40:             blk.12.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   41:           blk.12.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   42:             blk.12.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   43:        blk.12.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   44:             blk.12.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   45:             blk.12.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   46:          blk.13.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   47:           blk.13.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   48:           blk.13.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   49:             blk.13.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   50:           blk.13.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   51:             blk.13.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   52:        blk.13.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   53:             blk.13.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   54:             blk.13.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   55:          blk.14.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   56:           blk.14.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   57:           blk.14.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   58:             blk.14.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   59:           blk.14.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   60:             blk.14.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   61:        blk.14.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   62:             blk.14.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   63:             blk.14.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   64:             blk.15.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   65:             blk.15.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   66:           blk.2.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   67:            blk.2.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   68:            blk.2.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   69:              blk.2.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   70:            blk.2.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   71:              blk.2.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   72:         blk.2.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   73:              blk.2.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   74:              blk.2.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   75:           blk.3.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   76:            blk.3.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   77:            blk.3.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   78:              blk.3.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   79:            blk.3.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   80:              blk.3.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   81:         blk.3.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   82:              blk.3.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   83:              blk.3.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   84:           blk.4.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   85:            blk.4.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   86:            blk.4.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   87:              blk.4.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   88:            blk.4.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   89:              blk.4.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   90:         blk.4.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   91:              blk.4.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   92:              blk.4.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   93:           blk.5.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   94:            blk.5.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor   95:            blk.5.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   96:              blk.5.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor   97:            blk.5.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor   98:              blk.5.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor   99:         blk.5.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  100:              blk.5.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  101:              blk.5.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  102:           blk.6.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  103:            blk.6.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  104:            blk.6.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  105:              blk.6.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  106:            blk.6.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  107:              blk.6.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  108:         blk.6.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  109:              blk.6.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  110:              blk.6.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  111:           blk.7.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  112:            blk.7.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  113:            blk.7.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  114:              blk.7.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  115:            blk.7.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  116:              blk.7.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  117:         blk.7.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  118:              blk.7.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  119:              blk.7.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  120:           blk.8.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  121:            blk.8.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  122:            blk.8.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  123:              blk.8.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  124:            blk.8.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  125:              blk.8.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  126:         blk.8.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  127:              blk.8.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  128:              blk.8.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  129:           blk.9.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  130:            blk.9.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  131:            blk.9.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  132:              blk.9.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  133:            blk.9.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  134:              blk.9.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  135:         blk.9.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  136:              blk.9.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  137:              blk.9.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  138:          blk.15.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  139:           blk.15.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  143:        blk.15.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  144:             blk.15.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  145:          blk.16.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.16.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  147:           blk.16.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  148:             blk.16.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  151:        blk.16.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  152:             blk.16.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  153:             blk.16.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  154:          blk.17.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.17.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  156:           blk.17.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  157:             blk.17.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  160:        blk.17.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  161:             blk.17.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  162:             blk.17.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  163:          blk.18.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.18.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  165:           blk.18.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  166:             blk.18.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  169:        blk.18.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  170:             blk.18.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  171:             blk.18.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  172:          blk.19.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.19.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  174:           blk.19.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  175:             blk.19.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  178:        blk.19.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  179:             blk.19.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  180:             blk.19.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  181:          blk.20.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  182:           blk.20.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  183:           blk.20.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  184:             blk.20.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  187:        blk.20.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  188:             blk.20.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  189:             blk.20.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  190:          blk.21.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  191:           blk.21.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  192:           blk.21.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  193:             blk.21.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  196:        blk.21.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  197:             blk.21.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  198:             blk.21.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  199:          blk.22.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  200:           blk.22.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  201:           blk.22.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  202:             blk.22.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  205:        blk.22.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  206:             blk.22.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  208:          blk.23.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  209:           blk.23.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  210:           blk.23.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  211:             blk.23.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  214:        blk.23.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  215:             blk.23.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  217:          blk.24.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  218:           blk.24.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  219:           blk.24.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  220:             blk.24.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  223:        blk.24.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  224:             blk.24.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  226:          blk.25.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  227:           blk.25.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  228:           blk.25.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  229:             blk.25.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  232:        blk.25.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  233:             blk.25.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  235:          blk.26.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  236:           blk.26.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  237:           blk.26.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  238:             blk.26.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  241:        blk.26.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  242:             blk.26.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  244:          blk.27.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  245:           blk.27.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  246:           blk.27.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  247:             blk.27.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  250:        blk.27.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  251:             blk.27.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  253:          blk.28.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  254:           blk.28.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  255:           blk.28.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  256:             blk.28.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  259:        blk.28.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  260:             blk.28.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  262:          blk.29.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  263:           blk.29.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  264:           blk.29.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  265:             blk.29.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  268:        blk.29.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  269:             blk.29.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  271:           blk.30.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  277:                    output.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  280:           blk.30.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  281:          blk.31.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  282:           blk.31.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  283:           blk.31.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  285:           blk.31.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  287:        blk.31.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  290:          blk.32.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  291:           blk.32.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  292:           blk.32.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  293:             blk.32.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  294:           blk.32.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  295:             blk.32.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  296:        blk.32.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  297:             blk.32.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  298:             blk.32.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  299:          blk.33.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  300:           blk.33.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  301:           blk.33.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  302:             blk.33.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  303:           blk.33.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  304:             blk.33.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  305:        blk.33.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  306:             blk.33.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  307:             blk.33.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  308:          blk.34.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  309:           blk.34.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  310:           blk.34.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  311:             blk.34.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  312:           blk.34.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  313:             blk.34.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  314:        blk.34.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  315:             blk.34.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  316:             blk.34.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  317:          blk.35.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  318:           blk.35.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  319:           blk.35.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  320:             blk.35.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  321:           blk.35.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  322:             blk.35.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  323:        blk.35.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  324:             blk.35.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  325:             blk.35.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  326:          blk.36.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  327:           blk.36.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  328:           blk.36.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  329:             blk.36.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  330:           blk.36.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  331:             blk.36.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  332:        blk.36.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  333:             blk.36.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  334:             blk.36.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  335:          blk.37.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  336:           blk.37.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  337:           blk.37.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  338:             blk.37.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  339:           blk.37.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  340:             blk.37.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  341:        blk.37.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  342:             blk.37.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  343:             blk.37.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  344:          blk.38.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  345:           blk.38.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  346:           blk.38.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  347:             blk.38.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  348:           blk.38.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  349:             blk.38.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  350:        blk.38.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  351:             blk.38.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  352:             blk.38.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  353:          blk.39.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  354:           blk.39.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor  355:           blk.39.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  356:             blk.39.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor  357:           blk.39.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor  358:             blk.39.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  359:        blk.39.attn_output.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  360:             blk.39.attn_q.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  361:             blk.39.attn_v.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor  362:               output_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q6_K:  282 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 2048
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q6_K
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  128.29 MB (+ 1600.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 11656 MB
....................................................................................................
llama_new_context_with_model: kv self size  = 1600.00 MB
llama_new_context_with_model: compute buffer total size =  191.47 MB
llama_new_context_with_model: VRAM scratch buffer: 190.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': ''}, 'visible_models': None, 'h2ogpt_key': None, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'load_gptq': '', 'load_exllama': False, 'use_safetensors': False, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': True, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': '/mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}, 'model_path_llama': '/mnt/AI/models/TheBloke/Llama-2-13B-chat-GGUF/llama-2-13b-chat.Q6_K.gguf', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}
load INSTRUCTOR_Transformer
max_seq_length  512
/mnt/AI/h2ogpt/src/gradio_runner.py:732: GradioUnusedKwargWarning: You have unused kwarg parameters in Dropdown, please remove them: {'filterable': False}
  langchain_agents = gr.Dropdown(
/mnt/AI/h2ogpt/src/gradio_runner.py:836: GradioUnusedKwargWarning: You have unused kwarg parameters in Dropdown, please remove them: {'filterable': False}
  visible_models = gr.Dropdown(kwargs['all_models'],
Running on local URL:  http://0.0.0.0:7860

And it's fails. While it have plenty available VRAM

ggml_allocr_alloc: not enough space in the buffer (needed 167772160, largest block available 136312864)
GGML_ASSERT: /tmp/pip-install-lq9gmmv7/llama-cpp-python_511fca8a040049ac9ab4298ec532185f/vendor/llama.cpp/ggml-alloc.c:173: !"not enough space in the buffer"
Fatal Python error: Aborted

Current thread 0x00007ffa5ef82700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama_cpp.py", line 808 in llama_eval
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama.py", line 500 in eval
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama.py", line 777 in generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/llama_cpp/llama.py", line 957 in _create_completion
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/llamacpp.py", line 288 in _stream
  File "/mnt/AI/h2ogpt/src/gpt4all_llm.py", line 399 in _stream
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 341 in stream
  File "/mnt/AI/h2ogpt/src/gpt4all_llm.py", line 365 in _call
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 961 in _generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 475 in _generate_helper
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 582 in generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/llms/base.py", line 451 in generate_prompt
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/llm.py", line 102 in generate
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/llm.py", line 92 in _call
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/base.py", line 252 in __call__
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/llm.py", line 252 in predict
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/combine_documents/stuff.py", line 165 in combine_docs
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/combine_documents/base.py", line 106 in _call
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/langchain/chains/base.py", line 252 in __call__
  File "/mnt/AI/h2ogpt/src/utils.py", line 412 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007ffa5f983700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/concurrent/futures/thread.py", line 81 in _worker
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 975 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007ffa60db7700 (most recent call first):
  File "/mnt/AI/h2ogpt/src/utils_langchain.py", line 67 in __next__
  File "/mnt/AI/h2ogpt/src/gpt_langchain.py", line 3808 in _run_qa_db
  File "/mnt/AI/h2ogpt/src/gen.py", line 2382 in evaluate
  File "/mnt/AI/h2ogpt/src/gradio_runner.py", line 3015 in get_response
  File "/mnt/AI/h2ogpt/src/gradio_runner.py", line 3066 in bot
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/gradio/utils.py", line 695 in gen_wrapper
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/gradio/utils.py", line 326 in run_sync_iterator_async
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007ffab2d5a700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/asyncio/runners.py", line 118 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/asyncio/runners.py", line 190 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/uvicorn/server.py", line 61 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 975 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007ffab375b700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 324 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 622 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/apscheduler/schedulers/blocking.py", line 30 in _main_loop
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 975 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007ffaca3b0700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 324 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 622 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007ffacc654700 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 324 in wait
  File "/mnt/miniconda3/envs/python311/lib/python3.11/queue.py", line 180 in get
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/posthog/consumer.py", line 104 in next
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/posthog/consumer.py", line 73 in upload
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/posthog/consumer.py", line 62 in run
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/mnt/miniconda3/envs/python311/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x00007ffbdaaa5740 (most recent call first):
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/gradio/blocks.py", line 2202 in block_thread
  File "/mnt/AI/h2ogpt/src/gradio_runner.py", line 4183 in go_gradio
  File "/mnt/AI/h2ogpt/src/gen.py", line 1271 in main
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/fire/core.py", line 691 in _CallAndUpdateTrace
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/fire/core.py", line 475 in _Fire
  File "/mnt/miniconda3/envs/python311/lib/python3.11/site-packages/fire/core.py", line 141 in Fire
  File "/mnt/AI/h2ogpt/src/utils.py", line 64 in H2O_Fire
  File "/mnt/AI/h2ogpt/generate.py", line 12 in entrypoint_main
  File "/mnt/AI/h2ogpt/generate.py", line 16 in <module>

Extension modules: simplejson._speedups, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslib, pandas._libs.ops, numexpr.interpreter, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, lz4._version, lz4.frame._frame, psutil._psutil_linux, psutil._psutil_posix, matplotlib._c_internal_utils, PIL._imaging, matplotlib._path, kiwisolver._cext, matplotlib._image, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, yaml._yaml, sentencepiece._sentencepiece, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, sqlalchemy.cyextension.collections, sqlalchemy.cyextension.immutabledict, sqlalchemy.cyextension.processors, sqlalchemy.cyextension.resultproxy, sqlalchemy.cyextension.util, greenlet._greenlet, scipy._lib._ccallback_c, numpy.linalg.lapack_lite, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, regex._regex, sklearn.__check_build._check_build, sklearn.utils.murmurhash, sklearn.utils._isfinite, sklearn.utils._openmp_helpers, sklearn.utils._vector_sentinel, sklearn.feature_extraction._hashing_fast, sklearn.utils._logistic_sigmoid, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.utils._cython_blas, sklearn.svm._libsvm, sklearn.svm._liblinear, sklearn.svm._libsvm_sparse, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.utils.arrayfuncs, sklearn.utils._typedefs, sklearn.utils._readonly_array_wrapper, sklearn.metrics._dist_metrics, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_fast, sklearn.linear_model._cd_fast, sklearn._loss._loss, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, sklearn.linear_model._sag_fast, sklearn.datasets._svmlight_format_fast, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, google._upb._message, zstandard.backend_c, websockets.speedups, ujson, markupsafe._speedups, PIL._webp, uvloop.loop, httptools.parser.parser, httptools.parser.url_parser (total: 262)
Aborted

pseudotensor commented 9 months ago

Hi @slavag I'd like to solve the problem, but since it doesn't fail for me it is hard. Are you able to edit the code to understand why --max_seq_len=2048 is not enough? That is, I presume this leads to things to work since you said it was the bad commit:

git checkout main
git revert c38c19e507120eee8b06a2cfe4042c03f5610735
# run your command

If this does work, then try to understand how FakeTokenizer is being set differently when you pass --max_seq_len=2048. I don't understand how that commit can lead to the failure if 2048 is used. Thanks for help!

slavag commented 9 months ago

@pseudotensor I also tried to figure it out and made new clean environment, and now it's working fine, with 4096 and with 2048. I don't know what caused that issue. but creating entirely new and clean python env solved the issue. Thanks for your help

pseudotensor commented 9 months ago

Ok, if it comes back let me know...

h2oai / h2ogpt

ggml_allocr_alloc: not enough space in the buffer #932