Support Gemma2-2b model for inference in Android

FranzKafkaYu commented 2 months ago

MediaPipe Solution (you are using)

Android library：com.google.mediapipe:tasks-genai:0.10.14

Programming language

Android Java

Are you willing to contribute it

None

Describe the feature and the current behaviour/state

currently we have no suitable MediaPipe format for Gemma2-2b running in Android,MediaPipe Python libraries can't complete conversion

Will this change the current API? How?

no

Who will benefit with this feature?

all of us

Please specify the use cases for this feature

use the latest Gemma2 model with mediapipe

Any Other info

No response

FranzKafkaYu commented 2 months ago

from MediaPipe official website.it says we can use AI Edge Torch to convert Gemma2-2b to suitable format but there are no more details:

If MediaPipe Python Convert tool can support this conversion it would be good.

thanks for all of you developers.

KennethanCeyer commented 2 months ago

Hi @FranzKafkaYu, I am currently looking into running Gemma 2 on AI Edge. Would it be possible to verify the source of the referenced image?

Came across a similar source, which includes a guide for running with tflite, and am now validating the reproducibility of that setup.

(Source: https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference#gemma-2_2b)

Thanks in advance.

Other related issues:

https://github.com/google-ai-edge/mediapipe/issues/5594

KennethanCeyer commented 2 months ago

It seems that the issue was raised based on the following link: https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android#model

The method for converting using AI Edge Torch is detailed in the guidelines provided in the above link. Unfortunately, it seems that for now, the .tflite conversion must be done manually.

Based on this, it seems the conversion process would be as follows: Downloading the .ckpt file via Kaggle -> Converting to .tflite using AI Edge Torch -> Implementing Android inference with the .tflite file using MediaPipe.

graph TD
    A[Kaggle .ckpt file] --> B[AI Edge Torch .tflite conversion]
    B --> C[MediaPipe Android inference]

P.S. It seems there might be a typo in the Android guide. "AI Edge Troch" should be corrected to "AI Edge Torch in the website.

KennethanCeyer commented 2 months ago

I think the documentation should mention about https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/gemma/convert_gemma2_to_tflite.py

FranzKafkaYu commented 2 months ago

Hi @FranzKafkaYu, I am currently looking into running Gemma 2 on AI Edge. Would it be possible to verify the source of the referenced image?

Came across a similar source, which includes a guide for running with tflite, and am now validating the reproducibility of that setup.

(Source: https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference#gemma-2_2b)

Thanks in advance.

Other related issues:

5594

@KennethanCeyer Hi Ken,you can find more details via this link,if you wanna use a model with MediaPipe Solutions/Framework,you need to conver model,safetensors/pytorch format->tflite format->MediaPipe format.

currently if you use Gemma while not Gemma2,there are suitable formated models from kaggle,you can check this link,while Gemma2 doesn't.

MediaPipe has provided a python library for converting safetensors/pytorch format->MediaPipe format with two different methods,details here,but now this library doesn't support Gemma2 in Native model conversion,so the only choice is AI Edge model conversion,which need use AI Edge Torch tool to convert first to get the TFLite format and then use MediaPipe Python library to bundle the model.

But I have checked AI Edge Torch,it lacks details to how can we complete this convert first,and in MediaPipe LLM Inference API demonstrations there are little informations about how can we use these “bundled model”,which is ended with .task,the sample code used a native model,which is ended with .bin.

I have tried other project,like llama.cpp and gemma.cpp,the performance is not good because they mainlly use CPU to excute inference.You can have a try but I think MediaPipe witch GPU backend would be better.

I am not a native English speaker,so my English is not very good.Hope these info can help you.

FranzKafkaYu commented 2 months ago

I think the documentation should mention about https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/gemma/convert_gemma2_to_tflite.py

GOOD,I will try this script and see whether we can go to the next step

KennethanCeyer commented 2 months ago

Hi @FranzKafkaYu, Thank you for the explanation you've done an excellent job explaining the situation.

I've actually been investigating the same issue of using Gemma 2 with LiteRT(.tflite) on MediaPipe, which is what brought me to this discussion. From all the issues, code records, and documentation I've reviewed, it seems .tflite distribution of Gemma 2 hasn't yet been registered .tffile in the kaggle or huggingface registry. (It looks like they're working hard on this and it seems probably in their roadmap, but there's no official file available yet.)

Based on the most recent visible documentation, it appears we need to convert the .ckpt file to .tffile using AI Edge Torch, and then use it according to each specific use case. (It seems like the documentation is lacking. It doesn’t look like it’s been around for very long)

The code I mentioned above seems to be the closest thing to an official guide at the moment. I'm currently working on this myself, and I'm planning to write a blog post about it when I'm done. Once it's ready, I'll make sure to share the link here in this issue for reference.

Thanks again for your helpful insights and creating this issue, Franz.

KennethanCeyer commented 2 months ago

With quite a few questions expected around running Gemma 2 with MediaPipe, I made a Colab used for the conversion along with related issues and PRs. The notebook will be continuously updated until the official tflite or MediaPipe tasks are released.

Colab Notebook Link

kuaashish commented 2 months ago

Hi @FranzKafkaYu,

Apologies for the delayed response. Support for Gemma 2-2B is now available, and ongoing discussions are happening here. Please let us know if you require any further assistance, or if we can proceed to close the issue and mark it as internally resolved, as the feature has been implemented.

Thank you!!

jfduma commented 2 months ago

Hi @FranzKafkaYu , I've been encountering an issue when trying to run the script ai-edge-torch/ai_edge_torch/generative/examples/gemma/convert_gemma2_to_tflite.py or gemma2-to-tflite/convert.py. In both cases, the error happens at the line where the code tries to load a file using torch.load(file).

On Google Colab: this_file_tensors = torch.load(file) ^C (this ^C is not caused by pressing Ctrl+C on the keyboard, it happens automatically)

On local machine: The same line outputs Segmentation Fault. this_file_tensors = torch.load(file) Segmentation Fault

I've checked my system's memory, and it's not an issue of insufficient memory. The same error occurs consistently in both environments. Any suggestions on what could be causing this segmentation fault or how to troubleshoot further would be greatly appreciated! Thanks in advance!

colab logs： 2024-09-12 08:01:43.412352: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1726128103.448539 2980 cuda_dnn.cc:8322] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1726128103.459941 2980 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-12 08:01:43.505942: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /usr/local/lib/python3.10/dist-packages/torch_xla/init.py:202: UserWarning: tensorflow can conflict with torch-xla. Prefer tensorflow-cpu when using PyTorch/XLA. To silence this warning, pip uninstall -y tensorflow && pip install tensorflow-cpu. If you are in a notebook environment such as Colab or Kaggle, restart your notebook runtime afterwards. warnings.warn( /usr/local/lib/python3.10/dist-packages/ai_edge_torch/generative/utilities/loader.py:84: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. this_file_tensors = torch.load(file) ^C

talumbau commented 3 weeks ago

Hi,

Just wanted to update this issue with the latest info. Previously (as is discussed in this issue), Gemma 2 2B was only available in the LLM Inference API by going through a conversion pathway via ai_edge_torch. This was difficult for many people (especially due to the large memory requirements for conversion and quantization of the float checkpoint). So we have made the .task files of a quantized version of Gemma 2 available on Kaggle directly

They have the extension .task. You use these files just like any other with the LLM Inference API. Essentially, these files contain the model weights as well as binary information on the tokenizer for the model. Please give that a try! Note: GPU and CPU models are available but GPU is mostly likely to work on newer and high-end phones for now. Thanks for trying out the Inference API. We hope to have more info to share soon!

tyrmullen commented 3 weeks ago

Tiny correction: the CPU model is a .task file, representing a successful conversion through ai_edge_torch, but the GPU model is a .bin file.

kuaashish commented 4 days ago

Hi @FranzKafkaYu,

Could you please confirm if this issue is resolved or any further assistance is needed?

Thank you!!

google-ai-edge / mediapipe