Strange characters - Githubissues

I tried converting both to 4 bits and 2 bits, but inference in all ocassions outputs strange characters:

(base) ✘-1 desktop:~/dev/projects/ai/pyllama [main|✔]> python quant_infer.py --wbits 2 --load ~/data/ai/models/llama/pyllama-7B2b.pt --text "the meaning of life is" --max_length 32 --cuda cuda:0
⌛️ Loading model from /home/nico/data/ai/models/llama/pyllama-7B2b.pt...
✅ Model from /home/nico/data/ai/models/llama/pyllama-7B2b.pt is loaded successfully.
********************************************************************************
🦙: the meaning of life is a aapsamama� Achami0i�am Tam-fz ofz-� Spatchz�

I followed instructions in the README.md and ran the quantization this way:

(base) ✔ desktop:~/dev/projects/ai/pyllama [main|✔]> python -m llama.llama_quant decapoda-research/llama-7b-hf c4 --wbits 2 --save ~/data/ai/models/llama/pyllama-7B2b.pt
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [06:16<00:00, 11.42s/it]
Found cached dataset json (/home/nico/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
Found cached dataset json (/home/nico/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)

Quantize layer: 0 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 1 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 2 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 3 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 4 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 5 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 6 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 7 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 8 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 9 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 10 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 11 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 12 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 13 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 14 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 15 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 16 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 17 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 18 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 19 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 20 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 21 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 22 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 23 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 24 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 25 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 26 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 27 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 28 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 29 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 30 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,
Quantize layer: 31 ,self_attn.q_proj,self_attn.k_proj,self_attn.v_proj,self_attn.o_proj,mlp.gate_proj,mlp.down_proj,mlp.up_proj,model.layers.0.self_attn.q_proj
model.layers.0.self_attn.k_proj
model.layers.0.self_attn.v_proj
model.layers.0.self_attn.o_proj
model.layers.0.mlp.gate_proj
model.layers.0.mlp.down_proj
model.layers.0.mlp.up_proj
model.layers.1.self_attn.q_proj
model.layers.1.self_attn.k_proj
model.layers.1.self_attn.v_proj
model.layers.1.self_attn.o_proj
model.layers.1.mlp.gate_proj
model.layers.1.mlp.down_proj
model.layers.1.mlp.up_proj
model.layers.2.self_attn.q_proj
model.layers.2.self_attn.k_proj
model.layers.2.self_attn.v_proj
model.layers.2.self_attn.o_proj
model.layers.2.mlp.gate_proj
model.layers.2.mlp.down_proj
model.layers.2.mlp.up_proj
model.layers.3.self_attn.q_proj
model.layers.3.self_attn.k_proj
model.layers.3.self_attn.v_proj
model.layers.3.self_attn.o_proj
model.layers.3.mlp.gate_proj
model.layers.3.mlp.down_proj
model.layers.3.mlp.up_proj
model.layers.4.self_attn.q_proj
model.layers.4.self_attn.k_proj
model.layers.4.self_attn.v_proj
model.layers.4.self_attn.o_proj
model.layers.4.mlp.gate_proj
model.layers.4.mlp.down_proj
model.layers.4.mlp.up_proj
model.layers.5.self_attn.q_proj
model.layers.5.self_attn.k_proj
model.layers.5.self_attn.v_proj
model.layers.5.self_attn.o_proj
model.layers.5.mlp.gate_proj
model.layers.5.mlp.down_proj
model.layers.5.mlp.up_proj
model.layers.6.self_attn.q_proj
model.layers.6.self_attn.k_proj
model.layers.6.self_attn.v_proj
model.layers.6.self_attn.o_proj
model.layers.6.mlp.gate_proj
model.layers.6.mlp.down_proj
model.layers.6.mlp.up_proj
model.layers.7.self_attn.q_proj
model.layers.7.self_attn.k_proj
model.layers.7.self_attn.v_proj
model.layers.7.self_attn.o_proj
model.layers.7.mlp.gate_proj
model.layers.7.mlp.down_proj
model.layers.7.mlp.up_proj
model.layers.8.self_attn.q_proj
model.layers.8.self_attn.k_proj
model.layers.8.self_attn.v_proj
model.layers.8.self_attn.o_proj
model.layers.8.mlp.gate_proj
model.layers.8.mlp.down_proj
model.layers.8.mlp.up_proj
model.layers.9.self_attn.q_proj
model.layers.9.self_attn.k_proj
model.layers.9.self_attn.v_proj
model.layers.9.self_attn.o_proj
model.layers.9.mlp.gate_proj
model.layers.9.mlp.down_proj
model.layers.9.mlp.up_proj
model.layers.10.self_attn.q_proj
model.layers.10.self_attn.k_proj
model.layers.10.self_attn.v_proj
model.layers.10.self_attn.o_proj
model.layers.10.mlp.gate_proj
model.layers.10.mlp.down_proj
model.layers.10.mlp.up_proj
model.layers.11.self_attn.q_proj
model.layers.11.self_attn.k_proj
model.layers.11.self_attn.v_proj
model.layers.11.self_attn.o_proj
model.layers.11.mlp.gate_proj
model.layers.11.mlp.down_proj
model.layers.11.mlp.up_proj
model.layers.12.self_attn.q_proj
model.layers.12.self_attn.k_proj
model.layers.12.self_attn.v_proj
model.layers.12.self_attn.o_proj
model.layers.12.mlp.gate_proj
model.layers.12.mlp.down_proj
model.layers.12.mlp.up_proj
model.layers.13.self_attn.q_proj
model.layers.13.self_attn.k_proj
model.layers.13.self_attn.v_proj
model.layers.13.self_attn.o_proj
model.layers.13.mlp.gate_proj
model.layers.13.mlp.down_proj
model.layers.13.mlp.up_proj
model.layers.14.self_attn.q_proj
model.layers.14.self_attn.k_proj
model.layers.14.self_attn.v_proj
model.layers.14.self_attn.o_proj
model.layers.14.mlp.gate_proj
model.layers.14.mlp.down_proj
model.layers.14.mlp.up_proj
model.layers.15.self_attn.q_proj
model.layers.15.self_attn.k_proj
model.layers.15.self_attn.v_proj
model.layers.15.self_attn.o_proj
model.layers.15.mlp.gate_proj
model.layers.15.mlp.down_proj
model.layers.15.mlp.up_proj
model.layers.16.self_attn.q_proj
model.layers.16.self_attn.k_proj
model.layers.16.self_attn.v_proj
model.layers.16.self_attn.o_proj
model.layers.16.mlp.gate_proj
model.layers.16.mlp.down_proj
model.layers.16.mlp.up_proj
model.layers.17.self_attn.q_proj
model.layers.17.self_attn.k_proj
model.layers.17.self_attn.v_proj
model.layers.17.self_attn.o_proj
model.layers.17.mlp.gate_proj
model.layers.17.mlp.down_proj
model.layers.17.mlp.up_proj
model.layers.18.self_attn.q_proj
model.layers.18.self_attn.k_proj
model.layers.18.self_attn.v_proj
model.layers.18.self_attn.o_proj
model.layers.18.mlp.gate_proj
model.layers.18.mlp.down_proj
model.layers.18.mlp.up_proj
model.layers.19.self_attn.q_proj
model.layers.19.self_attn.k_proj
model.layers.19.self_attn.v_proj
model.layers.19.self_attn.o_proj
model.layers.19.mlp.gate_proj
model.layers.19.mlp.down_proj
model.layers.19.mlp.up_proj
model.layers.20.self_attn.q_proj
model.layers.20.self_attn.k_proj
model.layers.20.self_attn.v_proj
model.layers.20.self_attn.o_proj
model.layers.20.mlp.gate_proj
model.layers.20.mlp.down_proj
model.layers.20.mlp.up_proj
model.layers.21.self_attn.q_proj
model.layers.21.self_attn.k_proj
model.layers.21.self_attn.v_proj
model.layers.21.self_attn.o_proj
model.layers.21.mlp.gate_proj
model.layers.21.mlp.down_proj
model.layers.21.mlp.up_proj
model.layers.22.self_attn.q_proj
model.layers.22.self_attn.k_proj
model.layers.22.self_attn.v_proj
model.layers.22.self_attn.o_proj
model.layers.22.mlp.gate_proj
model.layers.22.mlp.down_proj
model.layers.22.mlp.up_proj
model.layers.23.self_attn.q_proj
model.layers.23.self_attn.k_proj
model.layers.23.self_attn.v_proj
model.layers.23.self_attn.o_proj
model.layers.23.mlp.gate_proj
model.layers.23.mlp.down_proj
model.layers.23.mlp.up_proj
model.layers.24.self_attn.q_proj
model.layers.24.self_attn.k_proj
model.layers.24.self_attn.v_proj
model.layers.24.self_attn.o_proj
model.layers.24.mlp.gate_proj
model.layers.24.mlp.down_proj
model.layers.24.mlp.up_proj
model.layers.25.self_attn.q_proj
model.layers.25.self_attn.k_proj
model.layers.25.self_attn.v_proj
model.layers.25.self_attn.o_proj
model.layers.25.mlp.gate_proj
model.layers.25.mlp.down_proj
model.layers.25.mlp.up_proj
model.layers.26.self_attn.q_proj
model.layers.26.self_attn.k_proj
model.layers.26.self_attn.v_proj
model.layers.26.self_attn.o_proj
model.layers.26.mlp.gate_proj
model.layers.26.mlp.down_proj
model.layers.26.mlp.up_proj
model.layers.27.self_attn.q_proj
model.layers.27.self_attn.k_proj
model.layers.27.self_attn.v_proj
model.layers.27.self_attn.o_proj
model.layers.27.mlp.gate_proj
model.layers.27.mlp.down_proj
model.layers.27.mlp.up_proj
model.layers.28.self_attn.q_proj
model.layers.28.self_attn.k_proj
model.layers.28.self_attn.v_proj
model.layers.28.self_attn.o_proj
model.layers.28.mlp.gate_proj
model.layers.28.mlp.down_proj
model.layers.28.mlp.up_proj
model.layers.29.self_attn.q_proj
model.layers.29.self_attn.k_proj
model.layers.29.self_attn.v_proj
model.layers.29.self_attn.o_proj
model.layers.29.mlp.gate_proj
model.layers.29.mlp.down_proj
model.layers.29.mlp.up_proj
model.layers.30.self_attn.q_proj
model.layers.30.self_attn.k_proj
model.layers.30.self_attn.v_proj
model.layers.30.self_attn.o_proj
model.layers.30.mlp.gate_proj
model.layers.30.mlp.down_proj
model.layers.30.mlp.up_proj
model.layers.31.self_attn.q_proj
model.layers.31.self_attn.k_proj
model.layers.31.self_attn.v_proj
model.layers.31.self_attn.o_proj
model.layers.31.mlp.gate_proj
model.layers.31.mlp.down_proj
model.layers.31.mlp.up_proj
Found cached dataset wikitext (/home/nico/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Found cached dataset wikitext (/home/nico/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
wikitext2
Evaluating ...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/nico/anaconda3/lib/python3.10/runpy.py:196 in _run_module_as_main                          │
│                                                                                                  │
│   193 │   main_globals = sys.modules["__main__"].__dict__                                        │
│   194 │   if alter_argv:                                                                         │
│   195 │   │   sys.argv[0] = mod_spec.origin                                                      │
│ ❱ 196 │   return _run_code(code, main_globals, None,                                             │
│   197 │   │   │   │   │    "__main__", mod_spec)                                                 │
│   198                                                                                            │
│   199 def run_module(mod_name, init_globals=None,                                                │
│                                                                                                  │
│ /home/nico/anaconda3/lib/python3.10/runpy.py:86 in _run_code                                     │
│                                                                                                  │
│    83 │   │   │   │   │      __loader__ = loader,                                                │
│    84 │   │   │   │   │      __package__ = pkg_name,                                             │
│    85 │   │   │   │   │      __spec__ = mod_spec)                                                │
│ ❱  86 │   exec(code, run_globals)                                                                │
│    87 │   return run_globals                                                                     │
│    88                                                                                            │
│    89 def _run_module_code(code, init_globals=None,                                              │
│                                                                                                  │
│ /media/nico/data/projects/ai/pyllama/llama/llama_quant.py:477 in <module>                        │
│                                                                                                  │
│   474                                                                                            │
│   475                                                                                            │
│   476 if __name__ == "__main__":                                                                 │
│ ❱ 477 │   run()                                                                                  │
│   478                                                                                            │
│                                                                                                  │
│ /media/nico/data/projects/ai/pyllama/llama/llama_quant.py:473 in run                             │
│                                                                                                  │
│   470 │   │   │   │   dataset, seed=args.seed, model=args.model, seqlen=model.seqlen, tokenize   │
│   471 │   │   │   )                                                                              │
│   472 │   │   │   print(dataset)                                                                 │
│ ❱ 473 │   │   │   llama_eval(model, testloader, args, dev)                                       │
│   474                                                                                            │
│   475                                                                                            │
│   476 if __name__ == "__main__":                                                                 │
│                                                                                                  │
│ /home/nico/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py:115 in              │
│ decorate_context                                                                                 │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ /media/nico/data/projects/ai/pyllama/llama/llama_quant.py:154 in llama_eval                      │
│                                                                                                  │
│   151 │   model.model.embed_tokens = model.model.embed_tokens.cpu()                              │
│   152 │   torch.cuda.empty_cache()                                                               │
│   153 │                                                                                          │
│ ❱ 154 │   outs = torch.zeros_like(inps)                                                          │
│   155 │   attention_mask = cache["attention_mask"]                                               │
│   156 │                                                                                          │
│   157 │   for i in range(len(layers)):                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 2.60 GiB (GPU 0; 5.78 GiB total capacity; 2.61 GiB already allocated; 2.33 GiB free; 2.65 GiB reserved in total by 
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The evaluation process couldn't complete because of lack of GPU memory, but the quantized version was saved succesfully.

Anyone has an advice?

juncongmoo / pyllama

Strange characters #82