text_generator_main.cc using tinyllama model to inference can show Garbled characters

nigelzzz commented 4 months ago

Description of the bug:

using generative/example/tiny_llama/convert_to_tflite.py to transfer model to *.tflite, (no quantize)

using text_generator_main.cc to load tiny_llama_seq512_kv1024.tflite, the output is


Prompt:
how are you?
Output text:
betbetbetesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdesterdü lipesterdü lipesterdü lipesterdü lipesterd lipesterd lipesterd lipesterd lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lip lipLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLoginLogin



### Actual vs expected behavior:

_No response_

### Any other information you'd like to share?

_No response_

pkgoogle commented 4 months ago

Hi @nigelzzz, can you please provide more information so that we may reproduce it? For example, what version of Python you are using? which branch you are using?

Please also provide reproduce steps like:

python convert_to_tflite.py
<whatever commands you used to run the model>

Thanks!

nigelzzz commented 4 months ago

Hi @pkgoogle , python version: 3.9.5 ai-edge-rotch branch: v.0.1.1

/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/convert_to_tflite.py
python3 convert_to_tflite.py

Then we can see tiny_llama_seq512_kv1024.tflite in current path.

i built /mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/c++/text_generator_main.cc. i modification you can reference it.

 // Prepare helpers
 std::unique_ptr<tflite::FlatBufferModel> LoadModel() {
   std::unique_ptr<tflite::FlatBufferModel> model =
@@ -85,7 +93,13 @@ std::unique_ptr<tflite::Interpreter> BuildInterpreter(
   tflite::ops::builtin::BuiltinOpResolver resolver;
   // NOTE: We need to manually register optimized OPs for KV-cache and
   // Scaled Dot Product Attention (SDPA).
-  tflite::ops::custom::GenAIOpsRegisterer(&resolver);
+  resolver.AddCustom("odml.update_kv_cache",
+                      tflite::ops::custom::Register_KV_CACHE());
+  resolver.AddCustom("odml.scaled_dot_product_attention",
+                      tflite::ops::custom::Register_SDPA());
+
+
+  //tflite::ops::custom::GenAIOpsRegisterer(&resolver);

parameter

model path: tiny_llama_seq512_kv1024.tflite
sentencepiece_model: TinyLlama-1.1B-Chat-v1.0/tokenizer.model
start_token :
stop token :
num_thread: 4

nigelzzz commented 4 months ago

@pkgoogle , btw, i have a little question, can i know where the source of /ai-edge-torch/tree/main/ai_edge_torch/generative/examples/tiny_llama /tiny_llama_lm_logits.pt.

Because i can't see the file on llama huggingface repo

haozha111 commented 4 months ago

@pkgoogle , btw, i have a little question, can i know where the source of /ai-edge-torch/tree/main/ai_edge_torch/generative/examples/tiny_llama /tiny_llama_lm_logits.pt.

Because i can't see the file on llama huggingface repo

the .pt file is used as a golden test set for our development, which is not available in HF. @talumbau can confirm as well.

nigelzzz commented 4 months ago

@haozha111 very thanks!!!

pkgoogle commented 4 months ago

Hi @nigelzzz, which checkpoint data are you using from the original tiny_llama model? Thanks for your help.

nigelzzz commented 4 months ago

@pkgoogle, that's my check point https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main

nigelzzz commented 3 months ago

@pkgoogle , hi, can you reproduce it, or has any suggestion to debug it, i can help to solve it

Thanks!!

pkgoogle commented 3 months ago

Hi @nigelzzz, @hheydary is currently assigned to this case. I would first try to see if you still get the same result if you removed your modifications first. If not, then you know it has something to do w/ your update. If so, you said "can show" so are you saying this happens often or just once in a while? If it happens in only particular instances, that will be good data to share with us. If it happens "all the time" ... this should show in the loss when validating on a known dataset. But yeah those would be good places to start. Hope that helps.

hheydary commented 3 months ago

Hi @nigelzzz, Instruction tuned models (an in general language models) are trained to recognize specialized tokens and take actions based on when they see those tokens. First, I noticed that you are not including BOS and EOS tokens when running the model. Those tokens for the model you mentioned can be found here. Additionally, for best results, you need to manually add the "chat template" that was used to train the model to your input prompt. From model's page on HF, the template would look like this:

# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...

i.e., (<|user|> \n PROMPT \n <|assistant|>.

nigelzzz commented 3 months ago

Hi @hheydary and @pkgoogle, my output still show garbled characters, https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py can i use above file to test text generation?

Prompt:
<|user|>
 Write an email:
 <|assistant|>
Output text:
agyagyagyagyagyagyagyagyagyagyagyagyagyagyagyagyagyścingtonścścścścingtonścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścirościrościrościrościrościrościrościrościrościrościrościrościroiroirościrościroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroirooczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczocz

hheydary commented 3 months ago

Unfortunately, I am not able to reproduce the issue that you are seeing. Using the following command:

bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=model.tflite --sentencepiece_model=tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=16

The model generates reasonable outputs.

A few things:

Make sure that you have the correct tokenizer file (shipped as a part of raw checkpoint)
Please make sure the correct set of arguments are passed, including start and stop tokens.

nigelzzz commented 3 months ago

@hheydary , thanks for your responce!!

Are you using tinlyllama to test it?
if i run below script https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py then in below block show pass, can I suppose the model i transfer is well, right?
```
assert torch.allclose(
  tiny_llama_goldens, lm_logits[0, idx.shape[1] - 1, :], atol=1e-05
)
```
which tensorflow librarys link with text_generator_main (libtensorflow.so or libtensorflowlite.so) Because my target machine is not android, its yocto linux. e.g., rpi4/5
Do you have any suggestion how to config it without android flag?
Or can you share your tinyllama model (tflite format)?
which version you used (v0.2.0)?

nigelzzz commented 3 months ago

@hheydary , when i using 0.2.0, then run python3 tiny_llama.py, the out will show .

git branch
* (HEAD detached at origin/release/0.2.0)

2024-08-07 11:09:48.229016: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1723028988.241253  364737 cuda_dnn.cc:8439] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1723028988.245210  364737 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-07 11:09:48.254251: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-07 11:09:48.938564: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py:153: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  tiny_llama_goldens = torch.load(current_dir / "tiny_llama_lm_logits.pt")
Traceback (most recent call last):
  File "/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py", line 168, in <module>
    define_and_run()
  File "/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py", line 162, in define_and_run
    assert torch.allclose(
AssertionError

nigelzzz commented 3 months ago

i using v0.2.0 branch

build command

/user/: CC=/usr/bin/clang-18 bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite --sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=1

output


Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: /mnt/data/nigel_wang/tensorflow_cache/153a550227f3ff2fa4e4811633058a05/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /mnt/data/nigel_wang/tensorflow_cache/153a550227f3ff2fa4e4811633058a05/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'XNNPACK' because it already exists.
INFO: Analyzed target //ai_edge_torch/generative/examples/c++:text_generator_main (147 packages loaded, 3826 targets configured).
INFO: From Compiling src/google/protobuf/generated_message_tctable_lite.cc [for tool]:
external/protobuf~/src/google/protobuf/generated_message_tctable_lite.cc:347:14: warning: unused function 'Offset' [-Wunused-function]
347 | inline void* Offset(void* base, uint32_t offset) {
  |              ^~~~~~
1 warning generated.
INFO: From Compiling src/google/protobuf/compiler/cpp/helpers.cc [for tool]:
external/protobuf~/src/google/protobuf/compiler/cpp/helpers.cc:197:25: warning: unused function 'VerifyInt32TypeToVerifyCustom' [-Wunused-function]
197 | inline VerifySimpleType VerifyInt32TypeToVerifyCustom(VerifyInt32Type t) {
  |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
INFO: From Executing genrule @@org_tensorflow//tensorflow/lite/acceleration/configuration:configuration_schema:
When you use --proto, that you should check for conformity yourself, using the existing --conform
INFO: Found 1 target...
Target //ai_edge_torch/generative/examples/c++:text_generator_main up-to-date:
bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main
INFO: Elapsed time: 276.290s, Critical Path: 109.56s
INFO: 1493 processes: 601 internal, 892 linux-sandbox.
INFO: Build completed successfully, 1493 total actions
INFO: Running command line: bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main '--tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite' '--sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model' '--prompt=<|user|> \n Write and email:\n <|assistant|>' '--start_token=<s>' '--stop_token=</s>' '--num_threads=1'
ERROR: Didn't find op for builtin opcode 'STABLEHLO_COMPOSITE' version '1'. An older version of this builtin might be supported. Are you using an old TFLite binary with a newer model?

ERROR: Registration failed.

Error at ai_edge_torch/generative/examples/c++/text_generator_main.cc:93



- above this error, i see in newer version has add `stablehlo_composite`, 
https://github.com/tensorflow/tensorflow/commit/f4f2393888af78879dc9b299786023fe87fbbcfc
- in WORKSPACE version, doesn't add 
     - _TENSORFLOW_GIT_COMMIT = "26d4ea90364daa14bbb2bc5c2aa68f5b70c4641f"
     - https://github.com/tensorflow/tensorflow/blob/26d4ea90364daa14bbb2bc5c2aa68f5b70c4641f/tensorflow/lite/core/kernels/register.cc#L385

nigelzzz commented 3 months ago

in 0.2.0

command

CC=/usr/bin/clang-18 bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite --sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=1

output

INFO: Running command line: bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main '--tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite' '--sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model' '--prompt=<|user|> \n Write and email:\n <|assistant|>' '--start_token=<s>' '--stop_token=</s>' '--num_threads=1'
normalizer.cc(52) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Prompt:
<|user|> \n Write and email:\n <|assistant|>
Output text:

nigelzzz commented 3 months ago

@hheydary , i think i found some good point

quantize bool = True : can decode successfully.

quantize bool = false : fail decode. e.g., above log, all is ??

def convert_tiny_llama_to_tflite(
checkpoint_path: str,
prefill_seq_len: int = 512,
kv_cache_max_len: int = 1024,
quantize: bool = True,
):

nigelzzz commented 3 months ago

@pkgoogle @hheydary @haozha111 , I think i found some good point, can reproduce by your side?

google-ai-edge / ai-edge-torch

text_generator_main.cc using tinyllama model to inference can show Garbled characters #109

Description of the bug: