Thanks guys for this awesome work. I was curious to run llama3-8B on my personal CPU, and the performance is quite impressive (nearly 2x llama.cpp for same model size on same HW).
However, I was quite surprised by how much hallucination the model was generating. Basically, for any prompt I was trying, the model generates few tokens to begin with, and then keeps repeating the same sentence over and over again.
For example, this is the output using i2_s quantization type:
(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-i2_s.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
................................................
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 3669.02 MiB
................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 16.16 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18
system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 4294967295
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1
Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She
llama_perf_sampler_print: sampling time = 13.60 ms / 139 runs ( 0.10 ms per token, 10222.84 tokens per second)
llama_perf_context_print: load time = 1256.47 ms
llama_perf_context_print: prompt eval time = 514.62 ms / 11 tokens ( 46.78 ms per token, 21.38 tokens per second)
llama_perf_context_print: eval time = 6035.57 ms / 127 runs ( 47.52 ms per token, 21.04 tokens per second)
llama_perf_context_print: total time = 6591.15 ms / 138 tokens
The same issue happens again when trying tl2 quantization:
(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-tl2.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
............................................
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 3.33 GiB (3.56 BPW)
llm_load_print_meta: general.name = Llama3-8B-1.58-100B-tokens
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 3405.69 MiB
............................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 16.16 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18
system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 4294967295
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1
Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She
llama_perf_sampler_print: sampling time = 12.62 ms / 139 runs ( 0.09 ms per token, 11016.01 tokens per second)
llama_perf_context_print: load time = 1245.30 ms
llama_perf_context_print: prompt eval time = 646.28 ms / 11 tokens ( 58.75 ms per token, 17.02 tokens per second)
llama_perf_context_print: eval time = 7543.01 ms / 127 runs ( 59.39 ms per token, 16.84 tokens per second)
llama_perf_context_print: total time = 8228.56 ms / 138 tokens
Type of issue
Thanks guys for this awesome work. I was curious to run llama3-8B on my personal CPU, and the performance is quite impressive (nearly 2x llama.cpp for same model size on same HW).
However, I was quite surprised by how much hallucination the model was generating. Basically, for any prompt I was trying, the model generates few tokens to begin with, and then keeps repeating the same sentence over and over again.
For example, this is the output using i2_s quantization type:
CPU
Intel Core Ultra 7 155H
OS
Windows 11