microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
420 stars 32 forks source link

Cannot run compile.py on my Apple M1Max. #25

Closed begoss closed 1 week ago

begoss commented 3 weeks ago

I built T-Mac and TVM with my Apple M1Max successfully, then I run python tools/run_pipeline.py -o /Users/huhao/Desktop/Project/LLM/Models/bitnet_b1_58-3B/bitnet_b1_58-3B, it got error:

[17:37:19] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:19] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`

[Task qgemm_lut_t4_int8_m6400_k8640_n1_b2]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/4) | 0.00 s[17:37:25] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:25] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:25] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:25] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
warning: ptr type is only supported in -opaque-pointers mode
define i32 @tbl_int8_reset(i32 noundef %0, ptr nocapture noundef writeonly %1) local_unnamed_addr #0 {
    warning: ptr type is only supported in -opaque-pointers mode
define i32 @tbl_int8_reset(i32 noundef %0, ptr nocapture noundef writeonly %1) local_unnamed_addr #0 {
          warning: ptr type is only supported in -opaque-pointers mode
define i32 @tbl_int8_reset(i32 noundef %0, ptr nocapture noundef writeonly %1) local_unnamed_addr #0 {
       warning: ptr type is only supported in -opaque-pointers mode
define i32 @tbl_int8_reset(i32 noundef %0, ptr nocapture noundef writeonly %1) local_unnamed_addr #0 {
                                                                                                   ^ 
     ^
               ^
                               ^

[Task qgemm_lut_t4_int8_m6400_k8640_n1_b2]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (4/4) | 6.24 sWARNING:root:Could not find any valid schedule for task Task(func_name=qgemm_lut_t4_int8_m6400_k8640_n1_b2, args=(6400, 1, 8640), kwargs={}, workload=('qgemm_lut_t4_int8_m6400_k8640_n1_b2', 6400, 1, 8640)). A file containing the errors has been written to /var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tvm_tuning_errors_l3ri_x4a.log.
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
WARNING:autotvm:Cannot find config for target=llvm -keys=arm_cpu,cpu -mcpu=apple-m2 -mtriple=arm64-apple-darwin23.1.0, workload=('qgemm_lut_t4_int8_m6400_k8640_n1_b2', 6400, 1, 8640). A fallback configuration is used, which may bring great performance regression.
[17:37:26] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 14.0.6 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-darwin23.1.0`, using default `-mcpu=generic`
warning: ptr type is only supported in -opaque-pointers mode
define i32 @tbl_int8_reset(i32 noundef %0, ptr nocapture noundef writeonly %1) local_unnamed_addr #0 {
                                           ^
 Done.
Traceback (most recent call last):
  File "compile.py", line 240, in <module>
    main()
  File "compile.py", line 230, in main
    compile(**device_kwargs)
  File "compile.py", line 126, in compile
    qgemm_mod = qgemm_lut.compile(
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/python/t_mac/ops/base.py", line 268, in compile
    func = tvm.build(s, tensors, name=template_name)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/driver/build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/src/target/llvm/llvm_instance.cc", line 176
error: expected type
define i32 @tbl_int8_reset(i32 noundef %0, ptr nocapture noundef writeonly %1) local_unnamed_addr #0 {
                                           ^

Here is my clang & llvm version:

Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin23.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

LLVM (http://llvm.org/):
  LLVM version 14.0.6
  Optimized build.
  Default target: arm64-apple-darwin23.4.0
  Host CPU: cyclone

I changed -mcpu to generic, then I got error like this:

Traceback (most recent call last):
  File "compile.py", line 240, in <module>
    main()
  File "compile.py", line 230, in main
    compile(**device_kwargs)
  File "compile.py", line 126, in compile
    qgemm_mod = qgemm_lut.compile(
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/python/t_mac/ops/base.py", line 255, in compile
    self.tuning(*args, n_trial=n_trial, thread_affinity=thread_affinity, **eval_kwargs)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/python/t_mac/ops/base.py", line 95, in tuning
    task = autotvm.task.create(template_name, args=args, target=self.target)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/autotvm/task/task.py", line 480, in create
    sch, _ = ret.func(*args)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/autotvm/task/task.py", line 240, in __call__
    return self.fcustomized(*args, **kwargs)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/python/t_mac/ops/base.py", line 72, in _func
    sch = self._schedule(tensors)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/python/t_mac/ops/qgemm.py", line 233, in _schedule
    intrin, ll_code, header_code, body_code = tbl(
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/python/t_mac/intrins/tbl.py", line 166, in tbl
    ll_code, header_code, body_code = _create_llvm("tbl.cc", body_code, cc, cc_opts)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/python/t_mac/intrins/utils.py", line 23, in _create_llvm
    ll_code = clang.create_llvm(
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/contrib/clang.py", line 107, in create_llvm
    raise RuntimeError(msg)
RuntimeError: Compilation error:
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmpuee4p_dg/input0.cc:354:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_bot_low  = vcvtq_f16_s16(adder_bot.get_low());
                                         ^
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmpuee4p_dg/input0.cc:355:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_bot_high = vcvtq_f16_s16(adder_bot.get_high());
                                         ^
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmpuee4p_dg/input0.cc:356:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_top_low  = vcvtq_f16_s16(adder_top.get_low());
                                         ^
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmpuee4p_dg/input0.cc:357:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_top_high = vcvtq_f16_s16(adder_top.get_high());
                                         ^
4 errors generated.

How can I solve this and run on my M1 successfully?

kaleid-liner commented 3 weeks ago

Please follow the guide. The pip install . will download llvm+clang=17.0.6 and build TVM for you.

begoss commented 3 weeks ago

Please follow the guide. The pip install . will download llvm+clang=17.0.6 and build TVM for you.

Thanks for reply, I fixed PLATFORM_LLVM_MAP and tried pip install . command, it downloaded llvm+clang=17.0.6 and built TVM automatically, then I run python tools/run_pipeline.py -o /Users/huhao/Desktop/Project/LLM/Models/bitnet_b1_58-3B/bitnet_b1_58-3B, it still stop on STEP.0:

Running STEP.0: Compile kernels
  Running command in /Users/huhao/Desktop/Project/LLM/T-MAC/deploy:
    python compile.py -o tuned -da -nt 4 -tb -gc -gs 128 -ags 64 -t -m hf-bitnet-3b -md /Users/huhao/Desktop/Project/LLM/Models/bitnet_b1_58-3B/bitnet_b1_58-3B
Please check logs/2024-08-21-14-46-35.log for what's wrong

The error log was the same:

Traceback (most recent call last):
  File "compile.py", line 240, in <module>
    main()
  File "compile.py", line 230, in main
    compile(**device_kwargs)
  File "compile.py", line 126, in compile
    qgemm_mod = qgemm_lut.compile(
  File "/opt/anaconda3/envs/tvm-build-test/lib/python3.8/site-packages/t_mac/ops/base.py", line 255, in compile
    self.tuning(*args, n_trial=n_trial, thread_affinity=thread_affinity, **eval_kwargs)
  File "/opt/anaconda3/envs/tvm-build-test/lib/python3.8/site-packages/t_mac/ops/base.py", line 95, in tuning
    task = autotvm.task.create(template_name, args=args, target=self.target)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/autotvm/task/task.py", line 480, in create
    sch, _ = ret.func(*args)
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/autotvm/task/task.py", line 240, in __call__
    return self.fcustomized(*args, **kwargs)
  File "/opt/anaconda3/envs/tvm-build-test/lib/python3.8/site-packages/t_mac/ops/base.py", line 72, in _func
    sch = self._schedule(tensors)
  File "/opt/anaconda3/envs/tvm-build-test/lib/python3.8/site-packages/t_mac/ops/qgemm.py", line 233, in _schedule
    intrin, ll_code, header_code, body_code = tbl(
  File "/opt/anaconda3/envs/tvm-build-test/lib/python3.8/site-packages/t_mac/intrins/tbl.py", line 166, in tbl
    ll_code, header_code, body_code = _create_llvm("tbl.cc", body_code, cc, cc_opts)
  File "/opt/anaconda3/envs/tvm-build-test/lib/python3.8/site-packages/t_mac/intrins/utils.py", line 23, in _create_llvm
    ll_code = clang.create_llvm(
  File "/Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/tvm/python/tvm/contrib/clang.py", line 107, in create_llvm
    raise RuntimeError(msg)
RuntimeError: Compilation error:
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmp0lajl8tn/input0.cc:354:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_bot_low  = vcvtq_f16_s16(adder_bot.get_low());
                                         ^
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmp0lajl8tn/input0.cc:355:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_bot_high = vcvtq_f16_s16(adder_bot.get_high());
                                         ^
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmp0lajl8tn/input0.cc:356:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_top_low  = vcvtq_f16_s16(adder_top.get_low());
                                         ^
/var/folders/bp/lv2qvml94f1fz9tzrtv0snkc0000gn/T/tmp0lajl8tn/input0.cc:357:42: error: always_inline function 'vcvtq_f16_s16' requires target feature 'fullfp16', but would be inlined into function 'tbl_g4_int8_float_update_impl' that is compiled without support for 'fullfp16'
            float16x8_t vec_v_top_high = vcvtq_f16_s16(adder_top.get_high());
                                         ^
4 errors generated.

I also tried modify config.cmake:

set(USE_LLVM "/Users/huhao/Desktop/Project/LLM/T-MAC/build/clang+llvm-17.0.6-arm64-apple-darwin22.0/bin/llvm-config")

then I rebuilt TMV, it used clang+llvm-17.0.6 successfully:

...
-- LLVM links against zlib
-- Found ZLIB: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk/usr/lib/libz.tbd (found version "1.2.12")
-- Found zstd: /opt/homebrew/lib/libzstd.dylib
-- LLVM links against static zstd
-- LLVM linker flag: -lcurses
-- LLVM links against xml2
-- Found LLVM_INCLUDE_DIRS=/Users/huhao/Desktop/Project/LLM/T-MAC/build/clang+llvm-17.0.6-arm64-apple-darwin22.0/include
-- Found LLVM_DEFINITIONS=-D__STDC_CONSTANT_MACROS;-D__STDC_FORMAT_MACROS;-D__STDC_LIMIT_MACROS
-- Found LLVM_LIBS=/Users/huhao/Desktop/Project/LLM/T-MAC/build/clang+llvm-17.0.6-arm64-apple-darwin22.0/lib/libLLVMWindowsManifest.a;/Users/huhao/Desktop/Project/LLM/T-MAC/build/clang+llvm-17.0.6-arm64-apple-darwin22.0/lib/libLLVMXRay.a;...
...
[100%] Building CXX object CMakeFiles/tvm_runtime_objs.dir/src/runtime/contrib/random/random.cc.o
[100%] Building CXX object CMakeFiles/tvm_runtime_objs.dir/src/runtime/contrib/sort/sort.cc.o
[100%] Built target tvm_runtime_objs
[100%] Linking CXX shared library libtvm_runtime.dylib
[100%] Linking CXX shared library libtvm.dylib
ld: warning: -undefined error is deprecated
ld: warning: -undefined error is deprecated
[100%] Built target tvm_runtime
[100%] Built target tvm

But still got same error after I run run_pipeline.py.

How can I solve this? Thank you.

kaleid-liner commented 3 weeks ago

Thanks for reply, I fixed PLATFORM_LLVM_MAP

Can you give me more details for the reason?

I changed -mcpu to generic, then I got error like this:

Is this line still being changed by you? -mcpu=apple-m2 should work for llvm+clang=17.0.6, even on Apple M1.

https://github.com/microsoft/T-MAC/blob/d90af6ce7dfddfad626924799dc3e03dbed9eb52/python/t_mac/utils.py#L10

kaleid-liner commented 2 weeks ago

I have pushed a fix to modify the arch map. However, can you confirm if -mcpu=apple-m2 work for you?

begoss commented 1 week ago

I have pushed a fix to modify the arch map. However, can you confirm if -mcpu=apple-m2 work for you?

I used the latest code, it worked on my M1Max, thank you!

Log start
main: build = 2854 (70c312d)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.4.0
main: seed  = 1725344641
[14:24:01] /Users/huhao/Desktop/Project/LLM/T-MAC/3rdparty/llama.cpp/ggml-tmac.cpp:38: ggml_tmac_init
llama_model_loader: loaded meta data with 25 key-value pairs and 288 tensors from /Users/huhao/Desktop/Project/LLM/Models/bitnet_b1_58-3B/bitnet_b1_58-3B/ggml-model.in.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bitnet
llama_model_loader: - kv   1:                               general.name str              = bitnet_b1_58-3B
llama_model_loader: - kv   2:                         bitnet.block_count u32              = 26
llama_model_loader: - kv   3:                      bitnet.context_length u32              = 2048
llama_model_loader: - kv   4:                    bitnet.embedding_length u32              = 3200
llama_model_loader: - kv   5:                 bitnet.feed_forward_length u32              = 8640
llama_model_loader: - kv   6:                bitnet.attention.head_count u32              = 32
llama_model_loader: - kv   7:             bitnet.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:                      bitnet.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9:    bitnet.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 32
llama_model_loader: - kv  11:                          bitnet.vocab_size u32              = 32002
llama_model_loader: - kv  12:                   bitnet.rope.scaling.type str              = linear
llama_model_loader: - kv  13:                 bitnet.rope.scaling.factor f32              = 1.000000
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type   i2:  182 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = bitnet
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 3200
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_rot            = 100
llm_load_print_meta: n_embd_head_k    = 100
llm_load_print_meta: n_embd_head_v    = 100
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3200
llm_load_print_meta: n_embd_v_gqa     = 3200
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8640
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = IN
llm_load_print_meta: model params     = 3.32 B
llm_load_print_meta: model size       = 965.21 MiB (2.44 BPW) 
llm_load_print_meta: general.name     = bitnet_b1_58-3B
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '</line>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/27 layers to GPU
llm_load_tensors:        CPU buffer size =   965.22 MiB
.................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   650.00 MiB
llama_new_context_with_model: KV self size  =  650.00 MiB, K (f16):  325.00 MiB, V (f16):  325.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   144.51 MiB
llama_new_context_with_model: graph nodes  = 942
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 1

<s> Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.Clients include the world’s largest and most influential companies across 190 countries.
Microsoft is the largest software company in the world, and one of the most valuable corporations in the world.
Microsoft has been at the forefront of innovation in technology, and their products are used by millions of people every day.
Microsoft has a strong commitment to technology, and their products are used by millions of people every day.
Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.
Microsoft has been at the forefront of innovation in
llama_print_timings:        load time =     183.89 ms
llama_print_timings:      sample time =       2.99 ms /   128 runs   (    0.02 ms per token, 42780.75 tokens per second)
llama_print_timings: prompt eval time =     300.17 ms /    24 tokens (   12.51 ms per token,    79.95 tokens per second)
llama_print_timings:        eval time =    2572.80 ms /   127 runs   (   20.26 ms per token,    49.36 tokens per second)
llama_print_timings:       total time =    2893.09 ms /   151 tokens
Log end