Open nigelzzz opened 4 months ago
Hi @nigelzzz, can you please provide more information so that we may reproduce it? For example, what version of Python you are using? which branch you are using?
Please also provide reproduce steps like:
python convert_to_tflite.py
<whatever commands you used to run the model>
Thanks!
Hi @pkgoogle ,
python version: 3.9.5
ai-edge-rotch branch: v.0.1.1
/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/convert_to_tflite.py
python3 convert_to_tflite.py
Then we can see tiny_llama_seq512_kv1024.tflite
in current path.
i built /mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/c++/text_generator_main.cc
.
i modification you can reference it.
// Prepare helpers
std::unique_ptr<tflite::FlatBufferModel> LoadModel() {
std::unique_ptr<tflite::FlatBufferModel> model =
@@ -85,7 +93,13 @@ std::unique_ptr<tflite::Interpreter> BuildInterpreter(
tflite::ops::builtin::BuiltinOpResolver resolver;
// NOTE: We need to manually register optimized OPs for KV-cache and
// Scaled Dot Product Attention (SDPA).
- tflite::ops::custom::GenAIOpsRegisterer(&resolver);
+ resolver.AddCustom("odml.update_kv_cache",
+ tflite::ops::custom::Register_KV_CACHE());
+ resolver.AddCustom("odml.scaled_dot_product_attention",
+ tflite::ops::custom::Register_SDPA());
+
+
+ //tflite::ops::custom::GenAIOpsRegisterer(&resolver);
parameter
@pkgoogle ,
btw, i have a little question,
can i know where the source of /ai-edge-torch/tree/main/ai_edge_torch/generative/examples/tiny_llama /tiny_llama_lm_logits.pt
.
Because i can't see the file on llama huggingface repo
@pkgoogle , btw, i have a little question, can i know where the source of
/ai-edge-torch/tree/main/ai_edge_torch/generative/examples/tiny_llama /tiny_llama_lm_logits.pt
.Because i can't see the file on llama huggingface repo
the .pt
file is used as a golden test set for our development, which is not available in HF. @talumbau can confirm as well.
@haozha111 very thanks!!!
Hi @nigelzzz, which checkpoint data are you using from the original tiny_llama model? Thanks for your help.
@pkgoogle, that's my check point https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main
@pkgoogle , hi, can you reproduce it, or has any suggestion to debug it, i can help to solve it
Thanks!!
Hi @nigelzzz, @hheydary is currently assigned to this case. I would first try to see if you still get the same result if you removed your modifications first. If not, then you know it has something to do w/ your update. If so, you said "can show" so are you saying this happens often or just once in a while? If it happens in only particular instances, that will be good data to share with us. If it happens "all the time" ... this should show in the loss when validating on a known dataset. But yeah those would be good places to start. Hope that helps.
Hi @nigelzzz, Instruction tuned models (an in general language models) are trained to recognize specialized tokens and take actions based on when they see those tokens. First, I noticed that you are not including BOS and EOS tokens when running the model. Those tokens for the model you mentioned can be found here. Additionally, for best results, you need to manually add the "chat template" that was used to train the model to your input prompt. From model's page on HF, the template would look like this:
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...
i.e., (<|user|> \n PROMPT \n <|assistant|>.
Hi @hheydary and @pkgoogle, my output still show garbled characters, https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py can i use above file to test text generation?
Prompt:
<|user|>
Write an email:
<|assistant|>
Output text:
agyagyagyagyagyagyagyagyagyagyagyagyagyagyagyagyagyścingtonścścścścingtonścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścścirościrościrościrościrościrościrościrościrościrościrościrościroiroirościrościroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroiroirooczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczoczocz
Unfortunately, I am not able to reproduce the issue that you are seeing. Using the following command:
bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=model.tflite --sentencepiece_model=tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=16
The model generates reasonable outputs.
A few things:
@hheydary , thanks for your responce!!
Are you using tinlyllama to test it?
if i run below script https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py then in below block show pass, can I suppose the model i transfer is well, right?
assert torch.allclose(
tiny_llama_goldens, lm_logits[0, idx.shape[1] - 1, :], atol=1e-05
)
which tensorflow librarys link with text_generator_main
(libtensorflow.so or libtensorflowlite.so)
Because my target machine is not android, its yocto linux. e.g., rpi4/5
Do you have any suggestion how to config it without android flag?
Or can you share your tinyllama model (tflite format)?
which version you used (v0.2.0)?
@hheydary ,
when i using 0.2.0, then run python3 tiny_llama.py
, the out will show .
git branch
* (HEAD detached at origin/release/0.2.0)
2024-08-07 11:09:48.229016: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1723028988.241253 364737 cuda_dnn.cc:8439] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1723028988.245210 364737 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-07 11:09:48.254251: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-07 11:09:48.938564: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py:153: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
tiny_llama_goldens = torch.load(current_dir / "tiny_llama_lm_logits.pt")
Traceback (most recent call last):
File "/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py", line 168, in <module>
define_and_run()
File "/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py", line 162, in define_and_run
assert torch.allclose(
AssertionError
/user/: CC=/usr/bin/clang-18 bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite --sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=1
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: /mnt/data/nigel_wang/tensorflow_cache/153a550227f3ff2fa4e4811633058a05/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'com_google_absl' because it already exists.
DEBUG: /mnt/data/nigel_wang/tensorflow_cache/153a550227f3ff2fa4e4811633058a05/external/org_tensorflow/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'XNNPACK' because it already exists.
INFO: Analyzed target //ai_edge_torch/generative/examples/c++:text_generator_main (147 packages loaded, 3826 targets configured).
INFO: From Compiling src/google/protobuf/generated_message_tctable_lite.cc [for tool]:
external/protobuf~/src/google/protobuf/generated_message_tctable_lite.cc:347:14: warning: unused function 'Offset' [-Wunused-function]
347 | inline void* Offset(void* base, uint32_t offset) {
| ^~~~~~
1 warning generated.
INFO: From Compiling src/google/protobuf/compiler/cpp/helpers.cc [for tool]:
external/protobuf~/src/google/protobuf/compiler/cpp/helpers.cc:197:25: warning: unused function 'VerifyInt32TypeToVerifyCustom' [-Wunused-function]
197 | inline VerifySimpleType VerifyInt32TypeToVerifyCustom(VerifyInt32Type t) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
INFO: From Executing genrule @@org_tensorflow//tensorflow/lite/acceleration/configuration:configuration_schema:
When you use --proto, that you should check for conformity yourself, using the existing --conform
INFO: Found 1 target...
Target //ai_edge_torch/generative/examples/c++:text_generator_main up-to-date:
bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main
INFO: Elapsed time: 276.290s, Critical Path: 109.56s
INFO: 1493 processes: 601 internal, 892 linux-sandbox.
INFO: Build completed successfully, 1493 total actions
INFO: Running command line: bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main '--tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite' '--sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model' '--prompt=<|user|> \n Write and email:\n <|assistant|>' '--start_token=<s>' '--stop_token=</s>' '--num_threads=1'
ERROR: Didn't find op for builtin opcode 'STABLEHLO_COMPOSITE' version '1'. An older version of this builtin might be supported. Are you using an old TFLite binary with a newer model?
ERROR: Registration failed.
Error at ai_edge_torch/generative/examples/c++/text_generator_main.cc:93
- above this error, i see in newer version has add `stablehlo_composite`,
https://github.com/tensorflow/tensorflow/commit/f4f2393888af78879dc9b299786023fe87fbbcfc
- in WORKSPACE version, doesn't add
- _TENSORFLOW_GIT_COMMIT = "26d4ea90364daa14bbb2bc5c2aa68f5b70c4641f"
- https://github.com/tensorflow/tensorflow/blob/26d4ea90364daa14bbb2bc5c2aa68f5b70c4641f/tensorflow/lite/core/kernels/register.cc#L385
in 0.2.0
CC=/usr/bin/clang-18 bazel run -c opt //ai_edge_torch/generative/examples/c++:text_generator_main -- --tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite --sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model --prompt="<|user|> \n Write and email:\n <|assistant|>" --start_token="<s>" --stop_token="</s>" --num_threads=1
INFO: Running command line: bazel-bin/ai_edge_torch/generative/examples/c++/text_generator_main '--tflite_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/ttiny_llama_seq512_kv1024.tflite' '--sentencepiece_model=/mnt/data/nigel_wang/ai-edge-torch/ai_edge_torch/generative/examples/tiny_llama/TinyLlama-1.1B-Chat-v1.0/tokenizer.model' '--prompt=<|user|> \n Write and email:\n <|assistant|>' '--start_token=<s>' '--stop_token=</s>' '--num_threads=1'
normalizer.cc(52) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Prompt:
<|user|> \n Write and email:\n <|assistant|>
Output text:
@hheydary , i think i found some good point
quantize bool = True
: can decode successfully.quantize bool = false
: fail decode. e.g., above log, all is ??
def convert_tiny_llama_to_tflite(
checkpoint_path: str,
prefill_seq_len: int = 512,
kv_cache_max_len: int = 1024,
quantize: bool = True,
):
@pkgoogle @hheydary @haozha111 , I think i found some good point, can reproduce by your side?
Description of the bug:
*.tflite, (no quantize)
tiny_llama_seq512_kv1024.tflite
, the output is