microsoft / BitNet

Official inference framework for 1-bit LLMs
MIT License
2.6k stars 181 forks source link

Hallucination for Llama3-8B-1.58-100B-tokens model with both i2_s and tl2 quantization #12

Open aahouzi opened 19 hours ago

aahouzi commented 19 hours ago

Type of issue

(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-i2_s.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
................................................
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3669.02 MiB
................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.16 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18

system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 4294967295
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1

Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She

llama_perf_sampler_print:    sampling time =      13.60 ms /   139 runs   (    0.10 ms per token, 10222.84 tokens per second)
llama_perf_context_print:        load time =    1256.47 ms
llama_perf_context_print: prompt eval time =     514.62 ms /    11 tokens (   46.78 ms per token,    21.38 tokens per second)
llama_perf_context_print:        eval time =    6035.57 ms /   127 runs   (   47.52 ms per token,    21.04 tokens per second)
llama_perf_context_print:       total time =    6591.15 ms /   138 tokens
(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-tl2.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
............................................
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 3.33 GiB (3.56 BPW)
llm_load_print_meta: general.name     = Llama3-8B-1.58-100B-tokens
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3405.69 MiB
............................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.16 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18

system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 4294967295
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1

Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She

llama_perf_sampler_print:    sampling time =      12.62 ms /   139 runs   (    0.09 ms per token, 11016.01 tokens per second)
llama_perf_context_print:        load time =    1245.30 ms
llama_perf_context_print: prompt eval time =     646.28 ms /    11 tokens (   58.75 ms per token,    17.02 tokens per second)
llama_perf_context_print:        eval time =    7543.01 ms /   127 runs   (   59.39 ms per token,    16.84 tokens per second)
llama_perf_context_print:       total time =    8228.56 ms /   138 tokens

CPU

Intel Core Ultra 7 155H

OS

Windows 11