Conversion of gemma-2-2b-it model to TensorFlow Lite

shubham0204 commented 3 months ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

None

OS Platform and Distribution

Google Colab (Linux) Ubuntu 22.04.3 LTS

MediaPipe Tasks SDK version

0.10.14

Task name (e.g. Image classification, Gesture recognition etc.)

LLM Inference

Programming Language and version (e.g. C++, Python, Java)

Python

Describe the actual behavior

The gemma-2-2b-it model must get converted to a TFLite model (for cpu)

Describe the expected behaviour

The converter.convert_checkpoint methods throws an AssertionError with no message

Standalone code/steps you may have used to try to get what you need

from huggingface_hub import hf_hub_download
import os
import mediapipe as mp
from mediapipe.tasks.python.genai import converter

REPO_ID = "google/gemma-2-2b-it"
FILENAMES = ["tokenizer.json", "tokenizer_config.json", "model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
os.environ['HF_TOKEN'] = "<token>"
for filename in FILENAMES:
  hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir="./gemma-2-2b-it")

config = converter.ConversionConfig(
    input_ckpt="/content/gemma-2-2b-it", 
    ckpt_format='safetensors', 
    model_type='GEMMA_2B', 
    backend="cpu", 
    output_dir="/content/intermediate/gemma-2-2b-it/", 
    combine_file_only=False, 
    vocab_model_file="/content/gemma-2-2b-it", 
    output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
)
converter.convert_checkpoint(config)

Other info / Complete Logs

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-ae16540c09c6> in <cell line: 14>()
     12     output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
     13 )
---> 14 converter.convert_checkpoint(config)

3 frames
/usr/local/lib/python3.10/dist-packages/mediapipe/tasks/python/genai/converter/quantization_util.py in quantize_tensor(var, axis, factor, sym, number_bits, use_fp, add_scale_eps, optimization_on_bound, p_value, per_channel, block_size)
    352   """
    353   # TODO: support jnp.float8_e5m2
--> 354   assert number_bits == 8 or number_bits == 4 , f"Number bits {number_bits}"
    355   jnp_var = jnp.asarray(var)
    356   # When using sub-channel, the contracting dim is split into a sub-channel

kuaashish commented 3 months ago

Hi @shubham0204,

Could you please confirm that you are using the example Colab provided here for model conversion and learning about the required arguments for the converter?

Thank you!!

shubham0204 commented 3 months ago

Yes @kuaashish I am using the same notebook. Here are the additional blocks of code I added to download Gemma 2 and convert it to TFLite,

from huggingface_hub import hf_hub_download
import os

REPO_ID = "google/gemma-2-2b-it"
FILENAMES = ["tokenizer.json", "tokenizer_config.json", "model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
os.environ['HF_TOKEN'] = "<token>"
for filename in FILENAMES:
  hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir="./gemma-2-2b-it")

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
    input_ckpt="/content/gemma-2-2b-it",
    ckpt_format='safetensors',
    model_type='GEMMA_2B',
    backend="cpu",
    output_dir="/content/intermediate/gemma-2-2b-it/",
    combine_file_only=False,
    vocab_model_file="/content/gemma-2-2b-it",
    output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
)
converter.convert_checkpoint(config)

Woody0414 commented 3 months ago

Add layer_norms in LayerType Class from /site-packages/mediapipe/tasks/python/genai/converter/safetensors_converter.py could pass throuth the Assert,but the output_tflite_file looks bad because its size does not reduce.


class LayerType(enum.Enum):
  """Enum for layer type."""

  NONE = 0
  ATTENTION = 1  # Layer is part of the attention module.
  FEEDFORWARD = 2  # Layer is part of the feedforward module in the Transformer.
  EMBEDDING = 3  # Layer is the embedding lookup or final projection layer.
  LAYER_NORM = (
      4  # Layer is layer normalization before and after attention layer.
  )
  LORA = 5  # Layer is LoRA weights augmented on the base model layers.

  @classmethod
  def get_layer_type(cls, layer_name: str):
    """Gets the layer type of the given layer name."""
    ffn_layers = [
        "mlp",
    ]
    attn_layers = [
        "self_attn",
    ]
    emb_layers = [
        "embed_tokens",
        "lm_head",
    ]
    layer_norms = [
        "input_layernorm",
        "post_attention_layernorm",
        "post_feedforward_layernorm",
        "pre_feedforward_layernorm",
        "final_layernorm",
        "model.norm.weight",
    ]
    lora_layers = ["lora"]
    if any(sub_name in layer_name for sub_name in lora_layers):
      return LayerType.LORA
    if any(sub_name in layer_name for sub_name in attn_layers):
      return LayerType.ATTENTION
    if any(sub_name in layer_name for sub_name in ffn_layers):
      return LayerType.FEEDFORWARD
    if any(sub_name in layer_name for sub_name in emb_layers):
      return LayerType.EMBEDDING
    if any(sub_name in layer_name for sub_name in layer_norms):
      return LayerType.LAYER_NORM
    else:
      return LayerType.NONE

shubham0204 commented 3 months ago

Thanks @Woody0414. I modified the Mediapipe source file, but then received the following error,

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-3-913a430439f8>](https://localhost:8080/#) in <cell line: 14>()
     12     output_tflite_file="/content/converted_models/gemma-2-2b-it-cpu"
     13 )
---> 14 converter.convert_checkpoint(config)

1 frames
[/usr/local/lib/python3.10/dist-packages/mediapipe/tasks/python/genai/converter/llm_converter.py](https://localhost:8080/#) in combined_weight_bins_to_tflite(model_type, backend, weight_path, output_tflite_file, vocab_model_file, lora_rank, lora_weight_path, lora_output_tflite_file)
    180     if lora_rank is not None:
    181       logging.fatal('LoRA is not supported for CPU backend.')
--> 182     model_ckpt_util.GenerateCpuTfLite(
    183         model_type,
    184         weight_path,

RuntimeError: NOT_FOUND: The path does not exist: /content/intermediate/gemma-2-2b-it/params.lm.transformer.x_layers_0.ff_layer.pre_layer_norm.scale_quantized_scale

The params.lm.transformer.x_layers_0.ff_layer.pre_layer_norm.scale file exists, but not params.lm.transformer.x_layers_0.ff_layer.pre_layer_norm.scale_quantized_scale

kuaashish commented 3 months ago

Hi @shubham0204,

It appears you are trying to convert the recently released Gemma-2-2b model. Our initial testing has focused on the Gemma 2b model, and you can find more information in our documentation here. Currently, this model cannot be converted into a TFLite format, though support for this is on our roadmap. However, we cannot provide a specific timeline for availability at this moment.

Thank you!!

github-actions[bot] commented 3 months ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 3 months ago

Are you satisfied with the resolution of your issue? Yes No

shubham0204 commented 2 months ago

@kuaashish would I get an update on this issue when the support to convert Gemma2 models is available?

FranzKafkaYu commented 2 months ago

I have encountered the same issue here,and I followed the example here,the error log:

Traceback (most recent call last):
  File "/home/franzkafka/Desktop/mediapipe/convert.py", line 15, in <module>
    converter.convert_checkpoint(config)
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/llm_converter.py", line 323, in convert_checkpoint
    maybe_quantize_and_write_tensors_to_bins(loader, config)
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/llm_converter.py", line 284, in maybe_quantize_and_write_tensors_to_bins
    quantized_tensors = quantize_by_actions(
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/llm_converter.py", line 169, in quantize_by_actions
    target_var, scale = quantization_util.quantize_tensor(
  File "/home/franzkafka/.local/lib/python3.10/site-packages/mediapipe/tasks/python/genai/converter/quantization_util.py", line 354, in quantize_tensor
    assert number_bits == 8 or number_bits == 4
AssertionError

FranzKafkaYu commented 2 months ago

@kuaashish Hi kuaashish,in MediaPipe docs it says that MediaPipe LLM inference API support gemma2 already,but now I can't find available Gemma2 TFLite format model from kaggle,so how can I use MediaPipe LLM Inference API to load Gemma2 models?

kuaashish commented 2 months ago

Hi @FranzKafkaYu,

Could you please create a new issue with a detailed description of the support you need? This will help us and the community identify and address the problem effectively with a relevant issue title.

Thank you!!

FranzKafkaYu commented 2 months ago

Hi @shubham0204,

It appears you are trying to convert the recently released Gemma-2-2b model. Our initial testing has focused on the Gemma 2b model, and you can find more information in our documentation here. Currently, this model cannot be converted into a TFLite format, though support for this is on our roadmap. However, we cannot provide a specific timeline for availability at this moment.

Thank you!!

issue created:https://github.com/google-ai-edge/mediapipe/issues/5610

talumbau commented 2 weeks ago

Hi,

We updated our docs to provide info on using Gemma2-2B here. When we initially supported Gemma2-2B, the only pathway to using it on-device was converting the model through ai_edge_torch. It still requires a system with a lot of memory to do the conversion+quantization, so we decided to just directly host the necessary file on Kaggle (at this URL). You can download the models through this interface:

For Gemma2-2b, we support a CPU version and a GPU version. Both versions work in the LLM Inference API. The CPU version is a classic "TF Lite" file and can be used in traditional ways, as shown in an example here

The LLM Inference API (doc link above), is a full featured offering that you can directly call via an Android app, as shown in our samples here

kuaashish commented 1 week ago

Hi @shubham0204,

Could you please review the above and confirm if we can close the status and mark it resolved internally?

Thank you!!

github-actions[bot] commented 14 hours ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

google-ai-edge / mediapipe