google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
Apache License 2.0
5.93k stars 502 forks source link

Gemma.cpp hangs on a Gemma 7B model that was finetuned using huggingface peft(QLoRA) #198

Open webbigdata-jp opened 4 months ago

webbigdata-jp commented 4 months ago

Hi, thanks for the interesting project!

I create Gemma 7B based model webbigdata/C3TR-Adapter.
This model is Huggingface transformer format and translation-only model with original prompt templates fine-tuned by QLoRA.

So, I convert this to pytorch(.ckpt), and result is f32_merge_model.ckpt
I have confirmed that f32_merge_model.ckpt works.

Then, run this command, no error message.
python3 convert_weights.py --tokenizer tokenizer.model --weight f32_merge_model.ckpt --output_file gemma_cpp_merge.bin --model_type 7b

Then, run this command, no error message.
./build/compress_weights --weights util/gemma_cpp_merge.bin --model 7b-pt --compressed_weights util/gemma_cpp_merge.sbs

Then, run gemma.cpp, no error message.
./build/gemma --tokenizer util/tokenizer.model --compressed_weights util/gemma_cpp_merge.sbs --model 7b-pt

and input my prompt
[### Instruction:\nTranslate English to Japanese.\n\n### Input:\nThis is a test input.\n\n### Response:\n]

but model can't output anything.

Is there something wrong with the procedure?

gemma-error hungon

jan-wassenberg commented 4 months ago

Hi, thanks for reaching out :) There are two likely causes. One is that Gemma is trained for a format like "user\n", but because the command line specifies 7b-pt instead of 7b-it, run.cc skips this formatting. Or did the fine-tune indeed start from the PT model? I would think that IT is more suited to this kind of interaction.

A second possible cause is that the finetune may generate weights with magnitude above 1.875, which may require a bit of extra work to support (setting the tensor's scaling factor). There is a check for this in compress_weights, but it is only enabled in 'debug' builds. The simplest way to test this is to build with msan or preferably asan enabled, if that is an option?

webbigdata-jp commented 4 months ago

Hi, Thank you for your reply.

The base model is https://huggingface.co/google/gemma-7b

That is, it is a PT model.

As far as I understand, if the prompt template is original, the PT model is often used instead of IT model.

I have also created a llama.cpp(gguf) version of this model, but it works without any template problems with llama.cpp.

I recompiled it using the following procedure, but no errors occurred. However, the situation is the same.

cd build/
rm -rf *
cd ..
cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address" -DCMAKE_C_FLAGS="-fsanitize=address"
cmake --build build
./build/compress_weights  --weights util/gemma_cpp_merge.bin --model 7b-pt --compressed_weights util/gemma_cpp_merge.sbs
./build/gemma --tokenizer util/tokenizer.model --compressed_weights util/gemma_cpp_merge.sbs --model 7b-pt

logs

dev3@pop-os:~/work/unsloth/gemma_check/v2_11_v27_last/upload/gemma.cpp/gemma.cpp$ cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address" -DCMAKE_C_FLAGS="-fsanitize=address"
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Deprecation Warning at build/_deps/highway-src/CMakeLists.txt:25 (cmake_policy):
  The OLD behavior for policy CMP0111 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Test HWY_EMSCRIPTEN
-- Performing Test HWY_EMSCRIPTEN - Failed
-- Performing Test HWY_RISCV
-- Performing Test HWY_RISCV - Failed
-- Looking for sys/auxv.h
-- Looking for sys/auxv.h - found
-- Looking for asm/hwcap.h
-- Looking for asm/hwcap.h - not found
CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/dev3/work/unsloth/gemma_check/v2_11_v27_last/upload/gemma.cpp/gemma.cpp/build/_deps/highway-build/googletest-download
[ 11%] Creating directories for 'googletest'
[ 22%] Performing download step (git clone) for 'googletest'
Cloning into 'googletest-src'...
HEAD is now at 43efa0a4 Merge pull request #3617 from Bagira80:fix_3616
[ 33%] Performing update step for 'googletest'
[ 44%] No patch step for 'googletest'
[ 55%] No configure step for 'googletest'
[ 66%] No build step for 'googletest'
[ 77%] No install step for 'googletest'
[ 88%] No test step for 'googletest'
[100%] Completed 'googletest'
[100%] Built target googletest
-- Found Python: /usr/bin/python3.10 (found version "3.10.6") found components: Interpreter
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
CMake Deprecation Warning at build/_deps/sentencepiece-src/CMakeLists.txt:15 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

-- VERSION: 0.2.0
-- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND
-- Using the multi-header code from /home/dev3/work/unsloth/gemma_check/v2_11_v27_last/upload/gemma.cpp/gemma.cpp/build/_deps/json-src/include/
-- Configuring done (31.5s)
-- Generating done (0.1s)
-- Build files have been written to: /home/dev3/work/unsloth/gemma_check/v2_11_v27_last/upload/gemma.cpp/gemma.cpp/build
dev3@pop-os:~/work/unsloth/gemma_check/v2_11_v27_last/upload/gemma.cpp/gemma.cpp$ cmake --build build

・・・

[100%] Built target compress_weights
dev3@pop-os:~/work/unsloth/gemma_check/v2_11_v27_last/upload/gemma.cpp/gemma.cpp$ ./build/compress_weights  --weights util/gemma_cpp_merge.bin --model 7b-pt --compressed_weights util/gemma_cpp_merge.sbs
Loading Parameters (size 3145728000): embedder_input_embedding
Loading Parameters (size 12288): final_norm_scale
Loading Parameters (layer=0, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=0, size 150994944): qkv_einsum_w
Loading Parameters (layer=0, size 603979776): gating_einsum_w
Loading Parameters (layer=0, size 301989888): linear_w
Loading Parameters (layer=0, size 12288): pre_attention_norm_scale
Loading Parameters (layer=0, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=1, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=1, size 150994944): qkv_einsum_w
Loading Parameters (layer=1, size 603979776): gating_einsum_w
Loading Parameters (layer=1, size 301989888): linear_w
Loading Parameters (layer=1, size 12288): pre_attention_norm_scale
Loading Parameters (layer=1, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=2, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=2, size 150994944): qkv_einsum_w
Loading Parameters (layer=2, size 603979776): gating_einsum_w
Loading Parameters (layer=2, size 301989888): linear_w
Loading Parameters (layer=2, size 12288): pre_attention_norm_scale
Loading Parameters (layer=2, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=3, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=3, size 150994944): qkv_einsum_w
Loading Parameters (layer=3, size 603979776): gating_einsum_w
Loading Parameters (layer=3, size 301989888): linear_w
Loading Parameters (layer=3, size 12288): pre_attention_norm_scale
Loading Parameters (layer=3, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=4, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=4, size 150994944): qkv_einsum_w
Loading Parameters (layer=4, size 603979776): gating_einsum_w
Loading Parameters (layer=4, size 301989888): linear_w
Loading Parameters (layer=4, size 12288): pre_attention_norm_scale
Loading Parameters (layer=4, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=5, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=5, size 150994944): qkv_einsum_w
Loading Parameters (layer=5, size 603979776): gating_einsum_w
Loading Parameters (layer=5, size 301989888): linear_w
Loading Parameters (layer=5, size 12288): pre_attention_norm_scale
Loading Parameters (layer=5, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=6, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=6, size 150994944): qkv_einsum_w
Loading Parameters (layer=6, size 603979776): gating_einsum_w
Loading Parameters (layer=6, size 301989888): linear_w
Loading Parameters (layer=6, size 12288): pre_attention_norm_scale
Loading Parameters (layer=6, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=7, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=7, size 150994944): qkv_einsum_w
Loading Parameters (layer=7, size 603979776): gating_einsum_w
Loading Parameters (layer=7, size 301989888): linear_w
Loading Parameters (layer=7, size 12288): pre_attention_norm_scale
Loading Parameters (layer=7, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=8, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=8, size 150994944): qkv_einsum_w
Loading Parameters (layer=8, size 603979776): gating_einsum_w
Loading Parameters (layer=8, size 301989888): linear_w
Loading Parameters (layer=8, size 12288): pre_attention_norm_scale
Loading Parameters (layer=8, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=9, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=9, size 150994944): qkv_einsum_w
Loading Parameters (layer=9, size 603979776): gating_einsum_w
Loading Parameters (layer=9, size 301989888): linear_w
Loading Parameters (layer=9, size 12288): pre_attention_norm_scale
Loading Parameters (layer=9, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=10, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=10, size 150994944): qkv_einsum_w
Loading Parameters (layer=10, size 603979776): gating_einsum_w
Loading Parameters (layer=10, size 301989888): linear_w
Loading Parameters (layer=10, size 12288): pre_attention_norm_scale
Loading Parameters (layer=10, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=11, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=11, size 150994944): qkv_einsum_w
Loading Parameters (layer=11, size 603979776): gating_einsum_w
Loading Parameters (layer=11, size 301989888): linear_w
Loading Parameters (layer=11, size 12288): pre_attention_norm_scale
Loading Parameters (layer=11, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=12, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=12, size 150994944): qkv_einsum_w
Loading Parameters (layer=12, size 603979776): gating_einsum_w
Loading Parameters (layer=12, size 301989888): linear_w
Loading Parameters (layer=12, size 12288): pre_attention_norm_scale
Loading Parameters (layer=12, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=13, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=13, size 150994944): qkv_einsum_w
Loading Parameters (layer=13, size 603979776): gating_einsum_w
Loading Parameters (layer=13, size 301989888): linear_w
Loading Parameters (layer=13, size 12288): pre_attention_norm_scale
Loading Parameters (layer=13, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=14, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=14, size 150994944): qkv_einsum_w
Loading Parameters (layer=14, size 603979776): gating_einsum_w
Loading Parameters (layer=14, size 301989888): linear_w
Loading Parameters (layer=14, size 12288): pre_attention_norm_scale
Loading Parameters (layer=14, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=15, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=15, size 150994944): qkv_einsum_w
Loading Parameters (layer=15, size 603979776): gating_einsum_w
Loading Parameters (layer=15, size 301989888): linear_w
Loading Parameters (layer=15, size 12288): pre_attention_norm_scale
Loading Parameters (layer=15, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=16, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=16, size 150994944): qkv_einsum_w
Loading Parameters (layer=16, size 603979776): gating_einsum_w
Loading Parameters (layer=16, size 301989888): linear_w
Loading Parameters (layer=16, size 12288): pre_attention_norm_scale
Loading Parameters (layer=16, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=17, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=17, size 150994944): qkv_einsum_w
Loading Parameters (layer=17, size 603979776): gating_einsum_w
Loading Parameters (layer=17, size 301989888): linear_w
Loading Parameters (layer=17, size 12288): pre_attention_norm_scale
Loading Parameters (layer=17, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=18, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=18, size 150994944): qkv_einsum_w
Loading Parameters (layer=18, size 603979776): gating_einsum_w
Loading Parameters (layer=18, size 301989888): linear_w
Loading Parameters (layer=18, size 12288): pre_attention_norm_scale
Loading Parameters (layer=18, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=19, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=19, size 150994944): qkv_einsum_w
Loading Parameters (layer=19, size 603979776): gating_einsum_w
Loading Parameters (layer=19, size 301989888): linear_w
Loading Parameters (layer=19, size 12288): pre_attention_norm_scale
Loading Parameters (layer=19, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=20, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=20, size 150994944): qkv_einsum_w
Loading Parameters (layer=20, size 603979776): gating_einsum_w
Loading Parameters (layer=20, size 301989888): linear_w
Loading Parameters (layer=20, size 12288): pre_attention_norm_scale
Loading Parameters (layer=20, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=21, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=21, size 150994944): qkv_einsum_w
Loading Parameters (layer=21, size 603979776): gating_einsum_w
Loading Parameters (layer=21, size 301989888): linear_w
Loading Parameters (layer=21, size 12288): pre_attention_norm_scale
Loading Parameters (layer=21, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=22, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=22, size 150994944): qkv_einsum_w
Loading Parameters (layer=22, size 603979776): gating_einsum_w
Loading Parameters (layer=22, size 301989888): linear_w
Loading Parameters (layer=22, size 12288): pre_attention_norm_scale
Loading Parameters (layer=22, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=23, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=23, size 150994944): qkv_einsum_w
Loading Parameters (layer=23, size 603979776): gating_einsum_w
Loading Parameters (layer=23, size 301989888): linear_w
Loading Parameters (layer=23, size 12288): pre_attention_norm_scale
Loading Parameters (layer=23, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=24, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=24, size 150994944): qkv_einsum_w
Loading Parameters (layer=24, size 603979776): gating_einsum_w
Loading Parameters (layer=24, size 301989888): linear_w
Loading Parameters (layer=24, size 12288): pre_attention_norm_scale
Loading Parameters (layer=24, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=25, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=25, size 150994944): qkv_einsum_w
Loading Parameters (layer=25, size 603979776): gating_einsum_w
Loading Parameters (layer=25, size 301989888): linear_w
Loading Parameters (layer=25, size 12288): pre_attention_norm_scale
Loading Parameters (layer=25, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=26, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=26, size 150994944): qkv_einsum_w
Loading Parameters (layer=26, size 603979776): gating_einsum_w
Loading Parameters (layer=26, size 301989888): linear_w
Loading Parameters (layer=26, size 12288): pre_attention_norm_scale
Loading Parameters (layer=26, size 12288): pre_ffw_norm_scale
Loading Parameters (layer=27, size 50331648): attn_vec_einsum_w
Loading Parameters (layer=27, size 150994944): qkv_einsum_w
Loading Parameters (layer=27, size 603979776): gating_einsum_w
Loading Parameters (layer=27, size 301989888): linear_w
Loading Parameters (layer=27, size 12288): pre_attention_norm_scale
Loading Parameters (layer=27, size 12288): pre_ffw_norm_scale
Regenerating c_embedding (786M), please wait
Compress 13986.7 MB/s
Regenerating c_final_norm (0M), please wait
Compress 179.8 MB/s
Regenerating pre_ff_ns_0 (0M), please wait
Compress 140.3 MB/s
Regenerating gating_ein_0 (150M), please wait
Compress 3437.7 MB/s
Regenerating linear_w_0 (75M), please wait
Compress 3459.8 MB/s
Regenerating qkv_ein_0 (37M), please wait
Compress 3426.1 MB/s
Regenerating att_ein_0 (12M), please wait
Compress 3388.5 MB/s
Regenerating pre_att_ns_0 (0M), please wait
Compress 135.2 MB/s
Regenerating pre_ff_ns_1 (0M), please wait
Compress 144.8 MB/s
Regenerating gating_ein_1 (150M), please wait
Compress 3410.8 MB/s
Regenerating linear_w_1 (75M), please wait
Compress 3342.5 MB/s
Regenerating qkv_ein_1 (37M), please wait
Compress 3357.8 MB/s
Regenerating att_ein_1 (12M), please wait
Compress 3432.3 MB/s
Regenerating pre_att_ns_1 (0M), please wait
Compress 276.3 MB/s
Regenerating pre_ff_ns_2 (0M), please wait
Compress 260.1 MB/s
Regenerating gating_ein_2 (150M), please wait
Compress 3126.9 MB/s
Regenerating linear_w_2 (75M), please wait
Compress 3125.8 MB/s
Regenerating qkv_ein_2 (37M), please wait
Compress 3123.7 MB/s
Regenerating att_ein_2 (12M), please wait
Compress 2984.7 MB/s
Regenerating pre_att_ns_2 (0M), please wait
Compress 202.9 MB/s
Regenerating pre_ff_ns_3 (0M), please wait
Compress 229.5 MB/s
Regenerating gating_ein_3 (150M), please wait
Compress 3125.8 MB/s
Regenerating linear_w_3 (75M), please wait
Compress 3127.7 MB/s
Regenerating qkv_ein_3 (37M), please wait
Compress 3124.8 MB/s
Regenerating att_ein_3 (12M), please wait
Compress 3117.9 MB/s
Regenerating pre_att_ns_3 (0M), please wait
Compress 194.0 MB/s
Regenerating pre_ff_ns_4 (0M), please wait
Compress 220.3 MB/s
Regenerating gating_ein_4 (150M), please wait
Compress 3127.1 MB/s
Regenerating linear_w_4 (75M), please wait
Compress 3126.5 MB/s
Regenerating qkv_ein_4 (37M), please wait
Compress 3430.4 MB/s
Regenerating att_ein_4 (12M), please wait
Compress 3451.5 MB/s
Regenerating pre_att_ns_4 (0M), please wait
Compress 159.4 MB/s
Regenerating pre_ff_ns_5 (0M), please wait
Compress 121.6 MB/s
Regenerating gating_ein_5 (150M), please wait
Compress 3449.6 MB/s
Regenerating linear_w_5 (75M), please wait
Compress 3458.6 MB/s
Regenerating qkv_ein_5 (37M), please wait
Compress 3393.3 MB/s
Regenerating att_ein_5 (12M), please wait
Compress 3450.0 MB/s
Regenerating pre_att_ns_5 (0M), please wait
Compress 678.4 MB/s
Regenerating pre_ff_ns_6 (0M), please wait
Compress 664.8 MB/s
Regenerating gating_ein_6 (150M), please wait
Compress 3079.9 MB/s
Regenerating linear_w_6 (75M), please wait
Compress 3128.9 MB/s
Regenerating qkv_ein_6 (37M), please wait
Compress 3125.7 MB/s
Regenerating att_ein_6 (12M), please wait
Compress 3119.1 MB/s
Regenerating pre_att_ns_6 (0M), please wait
Compress 285.9 MB/s
Regenerating pre_ff_ns_7 (0M), please wait
Compress 788.7 MB/s
Regenerating gating_ein_7 (150M), please wait
Compress 3129.1 MB/s
Regenerating linear_w_7 (75M), please wait
Compress 3130.6 MB/s
Regenerating qkv_ein_7 (37M), please wait
Compress 3127.0 MB/s
Regenerating att_ein_7 (12M), please wait
Compress 3120.5 MB/s
Regenerating pre_att_ns_7 (0M), please wait
Compress 887.0 MB/s
Regenerating pre_ff_ns_8 (0M), please wait
Compress 286.7 MB/s
Regenerating gating_ein_8 (150M), please wait
Compress 3129.5 MB/s
Regenerating linear_w_8 (75M), please wait
Compress 3129.3 MB/s
Regenerating qkv_ein_8 (37M), please wait
Compress 3097.7 MB/s
Regenerating att_ein_8 (12M), please wait
Compress 3118.4 MB/s
Regenerating pre_att_ns_8 (0M), please wait
Compress 235.6 MB/s
Regenerating pre_ff_ns_9 (0M), please wait
Compress 689.4 MB/s
Regenerating gating_ein_9 (150M), please wait
Compress 3537.5 MB/s
Regenerating linear_w_9 (75M), please wait
Compress 3194.4 MB/s
Regenerating qkv_ein_9 (37M), please wait
Compress 3396.8 MB/s
Regenerating att_ein_9 (12M), please wait
Compress 3405.4 MB/s
Regenerating pre_att_ns_9 (0M), please wait
Compress 203.8 MB/s
Regenerating pre_ff_ns_10 (0M), please wait
Compress 130.8 MB/s
Regenerating gating_ein_10 (150M), please wait
Compress 3427.2 MB/s
Regenerating linear_w_10 (75M), please wait
Compress 3415.8 MB/s
Regenerating qkv_ein_10 (37M), please wait
Compress 3437.7 MB/s
Regenerating att_ein_10 (12M), please wait
Compress 3404.7 MB/s
Regenerating pre_att_ns_10 (0M), please wait
Compress 351.1 MB/s
Regenerating pre_ff_ns_11 (0M), please wait
Compress 385.5 MB/s
Regenerating gating_ein_11 (150M), please wait
Compress 3420.4 MB/s
Regenerating linear_w_11 (75M), please wait
Compress 3571.7 MB/s
Regenerating qkv_ein_11 (37M), please wait
Compress 3507.2 MB/s
Regenerating att_ein_11 (12M), please wait
Compress 3112.2 MB/s
Regenerating pre_att_ns_11 (0M), please wait
Compress 340.4 MB/s
Regenerating pre_ff_ns_12 (0M), please wait
Compress 201.9 MB/s
Regenerating gating_ein_12 (150M), please wait
Compress 3127.2 MB/s
Regenerating linear_w_12 (75M), please wait
Compress 3459.5 MB/s
Regenerating qkv_ein_12 (37M), please wait
Compress 3122.3 MB/s
Regenerating att_ein_12 (12M), please wait
Compress 3111.4 MB/s
Regenerating pre_att_ns_12 (0M), please wait
Compress 231.7 MB/s
Regenerating pre_ff_ns_13 (0M), please wait
Compress 197.4 MB/s
Regenerating gating_ein_13 (150M), please wait
Compress 3497.9 MB/s
Regenerating linear_w_13 (75M), please wait
Compress 3444.4 MB/s
Regenerating qkv_ein_13 (37M), please wait
Compress 3452.2 MB/s
Regenerating att_ein_13 (12M), please wait
Compress 3444.3 MB/s
Regenerating pre_att_ns_13 (0M), please wait
Compress 173.9 MB/s
Regenerating pre_ff_ns_14 (0M), please wait
Compress 207.0 MB/s
Regenerating gating_ein_14 (150M), please wait
Compress 3445.2 MB/s
Regenerating linear_w_14 (75M), please wait
Compress 3504.2 MB/s
Regenerating qkv_ein_14 (37M), please wait
Compress 3459.7 MB/s
Regenerating att_ein_14 (12M), please wait
Compress 3368.1 MB/s
Regenerating pre_att_ns_14 (0M), please wait
Compress 284.1 MB/s
Regenerating pre_ff_ns_15 (0M), please wait
Compress 188.9 MB/s
Regenerating gating_ein_15 (150M), please wait
Compress 3448.9 MB/s
Regenerating linear_w_15 (75M), please wait
Compress 3413.1 MB/s
Regenerating qkv_ein_15 (37M), please wait
Compress 3493.0 MB/s
Regenerating att_ein_15 (12M), please wait
Compress 3483.5 MB/s
Regenerating pre_att_ns_15 (0M), please wait
Compress 199.6 MB/s
Regenerating pre_ff_ns_16 (0M), please wait
Compress 257.3 MB/s
Regenerating gating_ein_16 (150M), please wait
Compress 3356.0 MB/s
Regenerating linear_w_16 (75M), please wait
Compress 3126.0 MB/s
Regenerating qkv_ein_16 (37M), please wait
Compress 3433.0 MB/s
Regenerating att_ein_16 (12M), please wait
Compress 3548.6 MB/s
Regenerating pre_att_ns_16 (0M), please wait
Compress 186.8 MB/s
Regenerating pre_ff_ns_17 (0M), please wait
Compress 180.4 MB/s
Regenerating gating_ein_17 (150M), please wait
Compress 3460.5 MB/s
Regenerating linear_w_17 (75M), please wait
Compress 3501.1 MB/s
Regenerating qkv_ein_17 (37M), please wait
Compress 3321.4 MB/s
Regenerating att_ein_17 (12M), please wait
Compress 3486.2 MB/s
Regenerating pre_att_ns_17 (0M), please wait
Compress 241.6 MB/s
Regenerating pre_ff_ns_18 (0M), please wait
Compress 170.2 MB/s
Regenerating gating_ein_18 (150M), please wait
Compress 3487.9 MB/s
Regenerating linear_w_18 (75M), please wait
Compress 3475.3 MB/s
Regenerating qkv_ein_18 (37M), please wait
Compress 3499.6 MB/s
Regenerating att_ein_18 (12M), please wait
Compress 3470.2 MB/s
Regenerating pre_att_ns_18 (0M), please wait
Compress 318.0 MB/s
Regenerating pre_ff_ns_19 (0M), please wait
Compress 148.6 MB/s
Regenerating gating_ein_19 (150M), please wait
Compress 3481.2 MB/s
Regenerating linear_w_19 (75M), please wait
Compress 3486.3 MB/s
Regenerating qkv_ein_19 (37M), please wait
Compress 3496.0 MB/s
Regenerating att_ein_19 (12M), please wait
Compress 3107.8 MB/s
Regenerating pre_att_ns_19 (0M), please wait
Compress 778.9 MB/s
Regenerating pre_ff_ns_20 (0M), please wait
Compress 744.7 MB/s
Regenerating gating_ein_20 (150M), please wait
Compress 3129.1 MB/s
Regenerating linear_w_20 (75M), please wait
Compress 3129.2 MB/s
Regenerating qkv_ein_20 (37M), please wait
Compress 3128.8 MB/s
Regenerating att_ein_20 (12M), please wait
Compress 3120.4 MB/s
Regenerating pre_att_ns_20 (0M), please wait
Compress 365.0 MB/s
Regenerating pre_ff_ns_21 (0M), please wait
Compress 511.7 MB/s
Regenerating gating_ein_21 (150M), please wait
Compress 3127.7 MB/s
Regenerating linear_w_21 (75M), please wait
Compress 3126.1 MB/s
Regenerating qkv_ein_21 (37M), please wait
Compress 3125.0 MB/s
Regenerating att_ein_21 (12M), please wait
Compress 3116.6 MB/s
Regenerating pre_att_ns_21 (0M), please wait
Compress 165.7 MB/s
Regenerating pre_ff_ns_22 (0M), please wait
Compress 261.3 MB/s
Regenerating gating_ein_22 (150M), please wait
Compress 3126.8 MB/s
Regenerating linear_w_22 (75M), please wait
Compress 3126.7 MB/s
Regenerating qkv_ein_22 (37M), please wait
Compress 3121.3 MB/s
Regenerating att_ein_22 (12M), please wait
Compress 3114.7 MB/s
Regenerating pre_att_ns_22 (0M), please wait
Compress 292.8 MB/s
Regenerating pre_ff_ns_23 (0M), please wait
Compress 273.7 MB/s
Regenerating gating_ein_23 (150M), please wait
Compress 3126.8 MB/s
Regenerating linear_w_23 (75M), please wait
Compress 3125.2 MB/s
Regenerating qkv_ein_23 (37M), please wait
Compress 3120.4 MB/s
Regenerating att_ein_23 (12M), please wait
Compress 3112.5 MB/s
Regenerating pre_att_ns_23 (0M), please wait
Compress 513.4 MB/s
Regenerating pre_ff_ns_24 (0M), please wait
Compress 541.8 MB/s
Regenerating gating_ein_24 (150M), please wait
Compress 3130.7 MB/s
Regenerating linear_w_24 (75M), please wait
Compress 3127.4 MB/s
Regenerating qkv_ein_24 (37M), please wait
Compress 3126.7 MB/s
Regenerating att_ein_24 (12M), please wait
Compress 3108.2 MB/s
Regenerating pre_att_ns_24 (0M), please wait
Compress 200.2 MB/s
Regenerating pre_ff_ns_25 (0M), please wait
Compress 268.3 MB/s
Regenerating gating_ein_25 (150M), please wait
Compress 3537.1 MB/s
Regenerating linear_w_25 (75M), please wait
Compress 3532.5 MB/s
Regenerating qkv_ein_25 (37M), please wait
Compress 3467.1 MB/s
Regenerating att_ein_25 (12M), please wait
Compress 3281.4 MB/s
Regenerating pre_att_ns_25 (0M), please wait
Compress 412.5 MB/s
Regenerating pre_ff_ns_26 (0M), please wait
Compress 209.5 MB/s
Regenerating gating_ein_26 (150M), please wait
Compress 3486.7 MB/s
Regenerating linear_w_26 (75M), please wait
Compress 3534.6 MB/s
Regenerating qkv_ein_26 (37M), please wait
Compress 3564.1 MB/s
Regenerating att_ein_26 (12M), please wait
Compress 3506.6 MB/s
Regenerating pre_att_ns_26 (0M), please wait
Compress 319.6 MB/s
Regenerating pre_ff_ns_27 (0M), please wait
Compress 269.5 MB/s
Regenerating gating_ein_27 (150M), please wait
Compress 3413.9 MB/s
Regenerating linear_w_27 (75M), please wait
Compress 3501.0 MB/s
Regenerating qkv_ein_27 (37M), please wait
Compress 3516.9 MB/s
Regenerating att_ein_27 (12M), please wait
Compress 3361.3 MB/s
Regenerating pre_att_ns_27 (0M), please wait
Compress 198.0 MB/s

Thanks!

jan-wassenberg commented 4 months ago

Thank you for sharing. So we indeed want PT and your model is trained for the prompt format used. Also sounds like there is no range overrun for the compressed data.

@KumarGitesh2024 , can you help debug this? Perhaps we can insert printfs to understand which function in gemma.cc is the one that freezes?

KumarGitesh2024 commented 4 months ago

Hi @webbigdata-jp,

Can you share logs after printing the function using cout to understand which function in gemma.cc is producing issues or waiting to respond?

webbigdata-jp commented 4 months ago

Hi @KumarGitesh2024

cout

[ Reading prompt ] Entering GenerateGemma
Entering GenerateImpl
Entering RangeChecks
Entering Prefill
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
................Entering Prefill
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
..............Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Transformer
......

In my localtime 21:40 - 22:00 repeat same log. Thanks!

jan-wassenberg commented 4 months ago

Thanks for sharing the log. Looks like 28x Attention/FFW (one per layer). If we just end up calling Transformer without end, then it seems like if (token == EOS_ID) { is never hit. Does your instruction tuning include that token? Note that you can also return false from stream_token to stop generating.

webbigdata-jp commented 4 months ago

I found that not only EOT but also Japanese nor English was being output.

google original 2b-it-sfp.sbs

looks good.

./build/gemma --tokenizer util/tokenizer.spm --compressed_weights util/2b-it-sfp.sbs --model 2b-it

log

> こんにちは

[ Reading prompt ] Entering GenerateGemma
Entering GenerateImpl
Entering RangeChecks
Entering Prefill
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
..........Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW

こんにちはEntering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
!Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
日本語Entering Transformer
・・・

my model

looks bad.

./build/gemma --tokenizer util/tokenizer.spm --compressed_weights util/gemma_cpp_merge.sbs --model 7b-pt
> ### Instruction:\nTranslate Japanese to English.\n\n### Input:\nこんにちは\n\n### Response:\n               

[ Reading prompt ] Entering GenerateGemma
Entering GenerateImpl
Entering RangeChecks
Entering Prefill
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
................Entering Prefill
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
..........Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW

Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Transformer
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Attention
Entering FFW
Entering Transformer
・・・

My model outputs EOT without any problems before converting it to Gemma format.

jan-wassenberg commented 4 months ago

Interesting. It sounds like our fp8 compression might possibly be causing the trouble. Would you like to try building with -DGEMMA_WEIGHT_T=hwy::bfloat16_t, then rerunning the compress_weights? This should return a twice as large .sbs file. If that fixes the problem, then it's likely a numerical issue. If not, we might have some other bug.

webbigdata-jp commented 3 months ago

Hello. Unfortunately, there was no particular difference with the 16-bit version. Could you please confirm that I'm doing it right just to be sure?

cd build/
rm -rf *
cmake -B build -DWEIGHT_TYPE=hwy::bfloat16_t
cd ..
make

# Convert my original model to bf16bit pytorch file(f16_merge_model.ckpt)

# Convert to gemma file
python3 convert_weights.py --tokenizer tokenizer.model --weight f16_merge_model.ckpt --output_file gemma_cpp_merge16.bin --model_type 7b

# Convert to sbs format
./build/compress_weights --weights util/gemma_cpp_merge16.bin --model 7b-pt --compressed_weights util/gemma_cpp_merge16.sbs

./build/gemma --tokenizer util/tokenizer.model --compressed_weights util/gemma_cpp_merge16.sbs --model 7b-pt

ls -lrth util/*sbs

-rwx------ 1 dev3 dev3 2.9G Apr  6 06:37 util/2b-it-sfp.sbs
-rw-r--r-- 1 dev3 dev3 8.7G May 27 22:39 util/gemma_cpp_merge.sbs
-rw-r--r-- 1 dev3 dev3  16G Jun  7 11:53 util/gemma_cpp_merge16.sbs

I didn't see any error or warning messages during the conversion process.

jan-wassenberg commented 3 months ago

Thank you for sharing the command line. Unfortunately I see that our documentation was incorrect, sorry about that :( We had added a GEMMA_ prefix to the C++ macro name, and CMake has a different name for this: WEIGHT_TYPE. The effect of this is that your experiment would still have used the 8-bit SFP. Would you mind retrying with -DWEIGHT_TYPE=hwy::bfloat16_t?

Meanwhile, this weight typedef has been troublesome enough that we are now looking into compiling for all weight types, so that no more typedef is required.

webbigdata-jp commented 3 months ago

I recompiled it but the situation did not change.
When I output the token ID in auto stream_token in run.cc, all the token IDs were 0.

133 } else { 134 std::cout << "\nToken ID: [" << token << "]\n"; 135 std::string token_text;

Token ID: [0]

Only zeros are output in both 16-bit and 8-bit settings.

jan-wassenberg commented 3 months ago

@webbigdata-jp thank you for trying. This sounds like a serious bug. We do not run compress_weights often and it recently changed, so this is possible. We are very busy this week but I have made a note to investigate, thanks for letting us know :)

webbigdata-jp commented 3 months ago

Thank you.

I'm not in a hurry, so if you find out the cause, please let me know.

If I can get gemma.cpp to work, it will open the door to the possibility of running my gemma-based models on multiple platforms without compiling.