microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
508 stars 127 forks source link

I could not load the tokenizer with the self-built model (Gemma2, Phi3.5) #957

Closed escon1004 closed 1 month ago

escon1004 commented 1 month ago

Hello everyone,

I'm excited to be using ONNX Runtime GenAI. It's an amazing library for anyone looking to run models on their device. I've been learning how to use ONNX GenAI by following various tutorials

https://onnxruntime.ai/docs/genai/tutorials/phi2-python.html

I've tried building two models: Gemma2-2B and Phi3.5-Mini-Instruct

# without extra_options
python -m onnxruntime_genai.models.builder -m google/gemma-2-2b -e cpu -p int4 -o ./model_gemma
!python -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct -e cpu -p int4 -o ./model

# with extra_options
python -m onnxruntime_genai.models.builder -i hf_path -m google/gemma-2-2b -e cpu -p int4 -o ./model_gemma --extra_options int4_block_size=128 int4_accuracy_level=4

Both seem to work quite well.

Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {}
/opt/conda/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py:991: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
config.json: 100%|█████████████████████████| 3.45k/3.45k [00:00<00:00, 33.7MB/s]
configuration_phi3.py: 100%|███████████████| 11.2k/11.2k [00:00<00:00, 64.6MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
GroupQueryAttention (GQA) is used in this model.
/opt/conda/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:471: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
modeling_phi3.py: 100%|████████████████████| 73.8k/73.8k [00:00<00:00, 8.52MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
2024-10-06 09:43:08,107 transformers_modules.microsoft.Phi-3.5-mini-instruct.af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0.modeling_phi3 [WARNING] - `flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
2024-10-06 09:43:08,108 transformers_modules.microsoft.Phi-3.5-mini-instruct.af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0.modeling_phi3 [WARNING] - Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
model.safetensors.index.json: 100%|████████| 16.3k/16.3k [00:00<00:00, 92.4MB/s]
Downloading shards:   0%|                                 | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors:   0%|             | 0.00/4.97G [00:00<?, ?B/s]
...(omitted)...
generation_config.json: 100%|██████████████████| 195/195 [00:00<00:00, 2.02MB/s]
Reading embedding layer
... (omitted) ...
Reading decoder layer 31
Reading final norm
Reading LM head
Saving ONNX model in ./model
2024-10-06 09:46:41,851 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - start to quantize /model/layers.0/attn/qkv_proj/MatMul ...
... (omitted) ...
024-10-06 09:47:09,984 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - complete quantization of /lm_head/MatMul ...
/opt/conda/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:985: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Saving GenAI config in ./model
/opt/conda/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:796: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
tokenizer_config.json: 100%|███████████████| 3.98k/3.98k [00:00<00:00, 47.7MB/s]
tokenizer.model: 100%|████████████████████████| 500k/500k [00:00<00:00, 305MB/s]
tokenizer.json: 100%|██████████████████████| 1.84M/1.84M [00:00<00:00, 6.12MB/s]
added_tokens.json: 100%|███████████████████████| 306/306 [00:00<00:00, 3.71MB/s]
special_tokens_map.json: 100%|█████████████████| 665/665 [00:00<00:00, 7.95MB/s]
Saving processing files in ./model for GenAI

The model builds successfully, and when I tried to load it, there didn't seem to be any issues with the model I created.

import onnxruntime_genai as og

prompt = '''def print_prime(n):
    """
    Print all primes between 1 and n
    """'''

model=og.Model('./model_gemma')

However, when I tried to load the tokenizer from the model, an error occurred

tokenizer = og.Tokenizer(model)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 1
----> 1 tokenizer = og.Tokenizer(model)

RuntimeError: [json.exception.type_error.302] type must be string, but is array

image

here are my own config files (genai_config.json)

genai_config.json of gemma2-2b

{
    "model": {
        "bos_token_id": 2,
        "context_length": 8192,
        "decoder": {
            "session_options": {
                "log_id": "onnxruntime-genai",
                "provider_options": []
            },
            "filename": "model.onnx",
            "head_size": 256,
            "hidden_size": 2304,
            "inputs": {
                "input_ids": "input_ids",
                "attention_mask": "attention_mask",
                "past_key_names": "past_key_values.%d.key",
                "past_value_names": "past_key_values.%d.value"
            },
            "outputs": {
                "logits": "logits",
                "present_key_names": "present.%d.key",
                "present_value_names": "present.%d.value"
            },
            "num_attention_heads": 8,
            "num_hidden_layers": 26,
            "num_key_value_heads": 4
        },
        "eos_token_id": 1,
        "pad_token_id": 0,
        "type": "gemma2",
        "vocab_size": 256000
    },
    "search": {
        "diversity_penalty": 0.0,
        "do_sample": false,
        "early_stopping": true,
        "length_penalty": 1.0,
        "max_length": 8192,
        "min_length": 0,
        "no_repeat_ngram_size": 0,
        "num_beams": 1,
        "num_return_sequences": 1,
        "past_present_share_buffer": true,
        "repetition_penalty": 1.0,
        "temperature": 1.0,
        "top_k": 1,
        "top_p": 1.0
    }
}

genai_config.json of phi3.5-mini-instruct

{
    "model": {
        "bos_token_id": 1,
        "context_length": 131072,
        "decoder": {
            "session_options": {
                "log_id": "onnxruntime-genai",
                "provider_options": []
            },
            "filename": "model.onnx",
            "head_size": 96,
            "hidden_size": 3072,
            "inputs": {
                "input_ids": "input_ids",
                "attention_mask": "attention_mask",
                "past_key_names": "past_key_values.%d.key",
                "past_value_names": "past_key_values.%d.value"
            },
            "outputs": {
                "logits": "logits",
                "present_key_names": "present.%d.key",
                "present_value_names": "present.%d.value"
            },
            "num_attention_heads": 32,
            "num_hidden_layers": 32,
            "num_key_value_heads": 32
        },
        "eos_token_id": [
            32007,
            32001,
            32000
        ],
        "pad_token_id": 32000,
        "type": "phi3",
        "vocab_size": 32064
    },
    "search": {
        "diversity_penalty": 0.0,
        "do_sample": true,
        "early_stopping": true,
        "length_penalty": 1.0,
        "max_length": 131072,
        "min_length": 0,
        "no_repeat_ngram_size": 0,
        "num_beams": 1,
        "num_return_sequences": 1,
        "past_present_share_buffer": true,
        "repetition_penalty": 1.0,
        "temperature": 1.0,
        "top_k": 1,
        "top_p": 1.0
    }
}

I'm not exactly sure about the difference, but I noticed that the file list for the tokenizer is slightly different from the pre-trained model uploaded on Hugging Face.

image image

I also noticed that my own tokenizer files are the same as the original files from Hugging Face. https://huggingface.co/microsoft/Phi-3.5-mini-instruct/tree/main https://huggingface.co/google/gemma-2-2b/tree/main

Are there any specific requirements before building the ONNX model file? Should I convert the tokenizer format before starting?

anencore94 commented 1 month ago

I've encountered the exact same issue. tested on linux/amd64, cpu.

Just executing the tutorial (without converting, just use pre-converted model) : https://github.com/microsoft/onnxruntime-genai?tab=readme-ov-file#sample-code-for-phi-3-in-python does work.

but after converting on my own environments (https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/generate-e2e-example.sh), it failed with same error messages.

natke commented 1 month ago

Hi @escon1004 and @anencore94, there was a change in transformers code that caused this incompatibility with onnxruntime-genai. This will be resolved in the next release (0.5.0) coming at the end of October. In the mean time, there are two alternative workarounds that you can employ:

anencore94 commented 1 month ago

@natke Thanks for the reply :). Would you mind to tell me the specified change from v4.45.0 (which is the version in which the above change was introduced)?

wenbingl commented 1 month ago

@natke Thanks for the reply :). Would you mind to tell me the specified change from v4.45.0 (which is the version in which the above change was introduced)?

In these PR, https://github.com/huggingface/transformers/pull/32535, they upgraded the tokenizer to the latest version which introduced the new schema in tokenizer merge ranks.

natke commented 1 month ago

Closing this issue. Please -re-open or let us know if you experience any further issues.