Issue of Gemma 7B Q4 quantization

moon5bal commented 5 days ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

Yes

OS Platform and Distribution

Android 14

Mobile device if the issue happens on mobile device

QCOM ADP 8155

Browser and version if the issue happens on browser

com.google.mediapipe.examples.llminference

Programming Language and version

Java

MediaPipe version

0.10.18

Bazel version

No response

Solution

llmInference

Android Studio, NDK, SDK versions (if issue is related to building in Android environment)

Android Studio Koala | 2024.1.1

Xcode & Tulsi version (if issue is related to building for iOS)

No response

Describe the actual behavior

Gemma 7B Q4 quantization model creates infinitely

Describe the expected behaviour

generate normal sentences

Standalone code/steps you may have used to try to get what you need

Gemma 1.1 7B model has only released the Q8 quantization model, so I performed Q4 quantization using the mediapipe converter directly. 
I confirmed that there are three quantization-related options in the mediapipe converter as follows:

    attention_quant_bits
    feedforward_quant_bits
    embedding_quant_bits

When specifying embedding_quant_bits as 4, it could not generate a normal sentence. 
However, when specifying feedforward_quant_bits and attention_quant_bits as 4, it generates normal sentences but sometimes creates <bos> infinitely.

Below is my environment information using the mediapipe converter.

    mediapipe version : 0.10.18
    used model : gemma-7b-it
    ckpt_format: safetensors

Is there any related issue?
Do you have plans to release the Gemma 7B model with Q4 quantization?

Other info / Complete Logs

No response

schmidt-sebastian commented 3 days ago

@moon5bal Can you share the code you are using for the conversion? I will try to reproduce this on our end.

kuaashish commented 3 days ago

Hi @moon5bal,

Could you please review the above and provide the required information?

Thank you!!

moon5bal commented 3 days ago

Hi @schmidt-sebastian @kuaashish Here is code for the conversion

Thank you.

import os import mediapipe as mp from mediapipe.tasks.python.genai import converter

project_root = "/home/worker" checkpoint_path = f"{project_root}/model/gemma-7b-it" vocab_model_file = f"{project_root}/model/gemma-7b-it/tokenizer.model" output_path = f"{checkpoint_path}/tmp" ckpt_format='safetensors' model_type='GEMMA_7B' backend = 'gpu' output_tflite_file = f'{project_root}/conv_out/{model_type}_IT_Q4_feedforward4.bin'

config = converter.ConversionConfig( input_ckpt=checkpoint_path, ckpt_format=ckpt_format, model_type=model_type, attention_quant_bits=4, feedforward_quant_bits=4, embedding_quant_bits=8, backend=backend, output_dir=output_path, combine_file_only=False, vocab_model_file=vocab_model_file, output_tflite_file=output_tflite_file, )

converter.convert_checkpoint(config)

moon5bal commented 3 days ago

And I'm using below safetensor.

https://www.kaggle.com/models/google/gemma/transformers/1.1-7b-it

kuaashish commented 3 days ago

Hi @moon5bal,

Can you please follow this Colab example to convert the model for Gemma 7B, Q4 and let us know you are still facing the issue?

Thank you!!

moon5bal commented 3 days ago

Hi @kuaashish Thank you for your response. I created the convert script mentioned in the comment above by referring to the Colab you provided. Similarly, I used the mediapipe genai converter, but the additional work I did was to include the quantization options. I haven't tried it on Colab, but I will try to modify and do it.

google-ai-edge / mediapipe