Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS)

Export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS)

As per suggestion, I referred to ONNX Runtime Build Documentation and followed the steps below:

git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai
curl -L https://github.com/microsoft/onnxruntime/releases/download/v1.19.2/onnxruntime-linux-x64-1.19.2.tgz -o onnxruntime-linux-x64-1.19.2.tgz && \
tar xvzf onnxruntime-linux-x64-1.19.2.tgz && \
mv onnxruntime-linux-x64-1.19.2 ort
python build.py --config Release
python3 builder.py -m microsoft/Phi-3-small-8k-instruct -o phi3_small8k -e cpu -p fp16

However, I encountered the following error: AssertionError: Flash Attention is not available, but is needed for dense attention.

Detailed Trace:

Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {}
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:961: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
 warnings.warn(
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
 warnings.warn(
/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py:2509: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:53.)
 block_mask_dense_output = [xi.to_sparse_csr() for xi in block_mask_dense_output]
/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:469: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
 warnings.warn(
2024-09-20 08:54:16,513 transformers_modules.microsoft.Phi-3-small-8k-instruct.1535ae26fb4faada95c6950e8bc6e867cdad6b00.modeling_phi3_small [INFO] - Layer 2 is using dense attention since it is divisible by 2
Traceback (most recent call last):
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 2872, in <module>
  create_model(args.model_name, args.input, args.output, args.precision, args.execution_provider, args.cache_dir, **extra_options)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 2764, in create_model
  onnx_model.make_model(input_path)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/onnxruntime-genai/builder.py", line 1762, in make_model
  model = AutoModelForCausalLM.from_pretrained(self.model_name_or_path, cache_dir=self.cache_dir, use_auth_token=True, trust_remote_code=True, **extra_kwargs)
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
  return model_class.from_pretrained(
 File "/proj_sw/user_dev/kkannan/sep20_phi3_setup/kamal_3_10_12/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3832, in from_pretrained
  model = cls(config, *model_args, **model_kwargs)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 903, in __init__
  self.model = Phi3SmallModel(config)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 745, in __init__
  self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 745, in <listcomp>
  self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 651, in __init__
  self.self_attn = Phi3SmallSelfAttention(config, layer_idx)
 File "/home/kkannan/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-8k-instruct/1535ae26fb4faada95c6950e8bc6e867cdad6b00/modeling_phi3_small.py", line 218, in __init__
  assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
AssertionError: Flash Attention is not available, but is needed for dense attention

Note: I verified my build by exporting the Phi-3-min-4k-instruct model successfully.

Additionally, I want to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits as the output. is there any way to achieve it?

Any help from members of the official repository would be greatly appreciated!

Could you please try:

install the flash attention package: pip install flash-attn
the combination of cpu+fp16 is not supported. You can change he ep to cuda. python3 builder.py -m microsoft/Phi-3-small-8k-instruct -o phi3_small8k **-e cuda** -p fp16

BTW, those are the combination the tool support now: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML

Thanks for the response @yufenglee

To export Microsoft/Phi-3-small-8k-instruct onnx CPU model, CUDA is mandatory. we can't export it using CPU(because flash attention needs CUDA) . But we can run the resultant model on the CPU exported using -e cuda ? is that right? correct me if I am wrong.

And is there any way to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits only as the output?

CUDA is required to export the model for the reason you mentioned. '-e cuda' specifies that the exported ONNX is targeted to run with OnnxRuntime cuda EP. You can use '-e CPU' to export the ONNX model to run with ORT CPU. '-p fp16/fp32/int4' specifies the data type of the ONNX model.

positions_ids, present/past key/values are required inputs of the model. We don't have option to ignore them now. However, those inputs/outputs are managed by ORT GenAI API automatically. You can get logits with ORT GenAI API like this after you exporting the model: https://github.com/microsoft/onnxruntime-genai/blob/f5af7634824dd205bb8555b94c770158350bae05/test/python/test_onnxruntime_genai_api.py#L243

Here they mentioned Phi-3 small ONNX models can now run on CPU. I can't export Phi-3 small using CPU (because flash attention needs CUDA) and I can't run exported model in CPU if it targeted to run with OnnxRuntime cuda EP. Could you please clarify how we can run the Phi-3 small ONNX model on the CPU? @yufenglee

As per https://github.com/microsoft/onnxruntime-genai/pull/710#issue-2415518051, I referred to ONNX Runtime Build Documentation and followed the steps below

Since that PR was merged, the changes have been added to the latest versions of ONNX Runtime and ONNX Runtime GenAI. You can install the latest stable versions to produce the Phi-3 small ONNX model for CPU instead of needing to build from source.

Additionally, I want to export the model with input_ids and attention_mask as inputs (without position_ids, present, and past key values) and obtain logits as the output. is there any way to achieve it?

You can make the following modifications to the model builder to achieve this.

To remove the past and present key-value caches from being added as inputs and outputs to the ONNX model respectively, you can comment out this code block.

https://github.com/microsoft/onnxruntime-genai/blob/f5af7634824dd205bb8555b94c770158350bae05/src/python/py/models/builder.py#L495-L507

To make sure the attention ops do not reference the past and present key-value caches, please set the following variables to empty strings.

https://github.com/microsoft/onnxruntime-genai/blob/f5af7634824dd205bb8555b94c770158350bae05/src/python/py/models/builder.py#L1378-L1381

Position ids are added into the graph as an input when the RotaryEmbedding op needs to be created. The op is not created with FP32 CPU and INT4 CPU, so those configurations will produce an ONNX model that does not contain a position_ids input.

As mentioned above, however, the past and present key-value caches are required to run with ONNX Runtime GenAI.

AssertionError: Flash Attention is not available, but is needed for dense attention.

As mentioned above, pip install flash-attn should resolve this issue. The original Phi-3 small modeling file checks if the flash-attn package is installed because the package is required to run Phi-3 small with PyTorch. However, since the model builder only loads the model weights into memory and does not run the model, you do not need to have flash-attn installed to get the ONNX model.

Here's how you can get around this issue.

Clone the repo with the Phi-3 small model that you wish to use (for example, the Phi-3 small 8K repo).
Comment out the assert for flash attention (for example, this line in the Phi-3 small 8K modeling file).
Run the model builder as python3 builder.py -i /path/to/repo/you/cloned/ -o /path/to/output/folder/ -p {int4 or fp32} -e cpu. The -i flag will load the model from a local folder where you commented out the assert whereas the -m flag will download the model from Hugging Face and use the already-uploaded files that do not comment out the assert to load the model.

thanks for the detailed explanation @kunal-vaishnavi

I followed the suggested steps. In addition , I changed the opset version to 17 in builder.py because while using onnx.checker.check_model on exported model it throws below error.

Traceback (most recent call last):
  File "/proj_sw/user_dev/kkannan/sep21_phi3/check.py", line 84, in <module>
    export_torch_to_onnx_phi3()
  File "/proj_sw/user_dev/kkannan/sep21_phi3/check.py", line 62, in export_torch_to_onnx_phi3
    onnx.checker.check_model(model_file)
  File "/proj_sw/user_dev/kkannan/sep21_phi3/my_env/lib/python3.10/site-packages/onnx/checker.py", line 163, in check_model
    C.check_model_path(
onnx.onnx_cpp2py_export.checker.ValidationError: No Op registered for LayerNormalization with domain_version of 14

==> Context: Bad node spec for node. Name: /model/layers.0/input_layernorm/LayerNorm OpType: LayerNormalization

My aim is to perform inference using below script.

import os
from transformers import AutoTokenizer
import onnx
import onnxruntime
def export_torch_to_onnx_phi3():

    variant = "microsoft/Phi-3-small-8k-instruct"
    tokenizer = AutoTokenizer.from_pretrained(variant, return_tensors="pt", trust_remote_code=True)
    input_prompt = "Africa is an emerging economy because"
    inputs = tokenizer(
        input_prompt,
        return_tensors="pt",
        max_length=256,
        pad_to_max_length=True,
        truncation=True,
    )

    input_ids = inputs["input_ids"]
    attn_mask = inputs["attention_mask"]
    model_file = "model_2/model.onnx"

    # onnx_model = onnx.load()
    onnx.checker.check_model(model_file)
    print("check passed")

    ort_session = onnxruntime.InferenceSession(model_file)
    onnx_input = {"input_ids": input_ids.numpy(), "attention_mask": attn_mask.numpy()}
    onnx_output = ort_session.run(None, onnx_input)

if __name__ == "__main__":
    export_torch_to_onnx_phi3()

but this script throws below error. is there any additional modification need to be added in model script?

Traceback (most recent call last):
  File "/proj_sw/user_dev/kkannan/sep21_phi3/s1.py", line 34, in <module>
    export_torch_to_onnx_phi3()
  File "/proj_sw/user_dev/kkannan/sep21_phi3/s1.py", line 29, in export_torch_to_onnx_phi3
    ort_session = onnxruntime.InferenceSession(model_file)
  File "/proj_sw/user_dev/kkannan/sep21_phi3/my_env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/proj_sw/user_dev/kkannan/sep21_phi3/my_env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 480, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from model_2/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/SparseAttention", SparseAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float),"","","","","block_row_indices": tensor(int32),"block_col_indices": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/ReduceSum/Cast/output_0": tensor(int32),"cos_cache": tensor(float),"sin_cache": tensor(float),) -> ("/model/layers.0/attn/SparseAttention/output_0","","",) , Error Node (/model/layers.0/attn/SparseAttention)'s input 3 is marked single but has an empty string in the graph

I followed the suggested steps. In addition , I changed the opset version to 17 in builder.py because while using onnx.checker.check_model on exported model it throws below error.

This is an expected error no matter which opset version is used. The generated ONNX models contain operators that are in the ai.onnx domain and com.microsoft domain for optimized performance with ONNX Runtime. The ONNX checker works for models that contain operators only in the ai.onnx domain.

My aim is to perform inference using below script but this script throws below error. is there any additional modification need to be added in model script?

According to the op schema for SparseAttention, the past_key and past_value inputs as well as the present_key and present_value outputs are required since they are not marked with OpSchema::Optional.

To get a valid ONNX model, you will need to undo the model builder changes you made so that the past and present key-value caches are added back as inputs and outputs to both the ONNX model and the SparseAttention op. Once you have undone those changes and generated a new ONNX model with the model builder, you can add the following changes to your inference script.

# New imports to add
import numpy as np
from transformers import AutoConfig

# After your `variant = "microsoft/Phi-3-small-8k-instruct"` line
config = AutoConfig.from_pretrained(variant, trust_remote_code=True, cache_dir="./cache_dir")
np_dtype = np.float32

# After your `input_ids = inputs["input_ids"]` line
batch_size = input_ids.shape[0]
num_kv_heads = config.num_key_value_heads
past_seq_len = 0
head_size = config.hidden_size // config.num_attention_heads

# After your `onnx_input = {"input_ids": input_ids.numpy(), "attention_mask": attn_mask.numpy()}` line
empty_past_kv = np.zeros((batch_size, num_kv_heads, past_seq_len, head_size), dtype=np_dtype)
onnx_input.update({f"past_key_values.{i}.key": empty_past_kv for i in range(config.num_hidden_layers})
onnx_input.update({f"past_key_values.{i}.value": empty_past_kv for i in range(config.num_hidden_layers})

# After your `onnx_output = ort_session.run(None, onnx_input)` line
logits = onnx_output[0]

Given that you will need the past and present key-value caches for the ONNX model and from reading your inference script, it appears you can use ONNX Runtime GenAI to simplify your inference script and improve model performance. Here is an example inference script for Phi-3 that applies the Phi-3 chat template. For a more general-purpose and simpler inference script, here is another example.

You can also swap out ONNX Runtime GenAI's tokenizer with Hugging Face's tokenizer in these examples if you want. For the Phi-3 specific inference script, you can set params.input_ids to the input_ids you get from Hugging Face and replace tokenizer_stream.decode(new_token) with tokenizer.batch_decode([new_token]) from Hugging Face. For the generic inference script, you can set params.input_ids to the input_ids you get from Hugging Face and replace this loop with tokenizer.batch_decode from Hugging Face.

Got it , thanks for the suggestions @kunal-vaishnavi

As layer normalization is available from opset 17, I changed the opset version from 14(default) to 17. The resultant model didn't faced Context: Bad node spec for node. Name: /model/layers.0/input_layernorm/LayerNorm OpType: LayerNormalization while checking with onnx.checker.check_model(model_file).That's why I mentioned.
when past_seq_len= 0, the model faced

onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running SparseAttention node. Name:'/model/layers.0/attn/SparseAttention' Status Message: max_cache_sequence_length should be no less than total_sequence_length:256, max_cache_sequence_length:0

when past_seq_len>=256, model faced,

/proj_sw/user_dev/kkannan/sep21_phi3/my_env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2870: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
  warnings.warn(
[1;31m2024-09-23 20:23:19.353896258 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running SparseAttention node. Name:'/model/layers.0/attn/SparseAttention' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/sparse/sparse_attention.cc:105 onnxruntime::common::Status onnxruntime::contrib::SparseAttention<T>::Compute(onnxruntime::OpKernelContext*) const [with T = float] past_key->DataRaw() == present_key->DataRaw() && past_value->DataRaw() == present_value->DataRaw() was false. 

Traceback (most recent call last):
  File "/proj_sw/user_dev/kkannan/sep21_phi3/n1_e.py", line 50, in <module>
    export_torch_to_onnx_phi3()
  File "/proj_sw/user_dev/kkannan/sep21_phi3/n1_e.py", line 45, in export_torch_to_onnx_phi3
    onnx_output = ort_session.run(None, onnx_input)
  File "/proj_sw/user_dev/kkannan/sep21_phi3/my_env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running SparseAttention node. Name:'/model/layers.0/attn/SparseAttention' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cpu/sparse/sparse_attention.cc:105 onnxruntime::common::Status onnxruntime::contrib::SparseAttention<T>::Compute(onnxruntime::OpKernelContext*) const [with T = float] past_key->DataRaw() == present_key->DataRaw() && past_value->DataRaw() == present_value->DataRaw() was false.

# inference script for reference 

import os
from transformers import AutoTokenizer
import onnx
import onnxruntime
import numpy as np
from transformers import AutoConfig

def export_torch_to_onnx_phi3():

    variant = "microsoft/Phi-3-small-8k-instruct"
    tokenizer = AutoTokenizer.from_pretrained(variant, return_tensors="pt", trust_remote_code=True)
    config = AutoConfig.from_pretrained(variant, trust_remote_code=True)
    np_dtype = np.float32

    input_prompt = "Africa is an emerging economy because"
    inputs = tokenizer(
        input_prompt,
        return_tensors="pt",
        max_length=256,
        pad_to_max_length=True,
        truncation=True,
    )

    input_ids = inputs["input_ids"]
    batch_size = input_ids.shape[0]
    num_kv_heads = config.num_key_value_heads
    past_seq_len = 300
    head_size = config.hidden_size // config.num_attention_heads
    attn_mask = inputs["attention_mask"]
    model_file = "new_2/model.onnx"

    onnx_model = onnx.load()
    onnx.checker.check_model(model_file)
    print("check passed")

    ort_session = onnxruntime.InferenceSession(model_file)
    onnx_input = {"input_ids": input_ids.numpy(), "attention_mask": attn_mask.numpy()}
    empty_past_kv = np.zeros((batch_size, num_kv_heads, past_seq_len, head_size), dtype=np_dtype)
    onnx_input.update({f"past_key_values.{i}.key": empty_past_kv for i in range(config.num_hidden_layers)})
    onnx_input.update({f"past_key_values.{i}.value": empty_past_kv for i in range(config.num_hidden_layers)})

    print("onnx_input",onnx_input)
    onnx_output = ort_session.run(None, onnx_input)
    print("onnx_output",onnx_output)

if __name__ == "__main__":
    export_torch_to_onnx_phi3()

So, the conclusion is that in CPU, exporting PHI3 Small variants to onnx model without past and present key and values is not valid as those are mandatory inputs to SparseAttention.

As layer normalization is available from opset 17, I changed the opset version from 14(default) to 17. The resultant model didn't faced Context: Bad node spec for node. Name: /model/layers.0/input_layernorm/LayerNorm OpType: LayerNormalization while checking with onnx.checker.check_model(model_file).That's why I mentioned.

Yes, the LayerNorm-specific error will go away with opset 17 or higher. But since the ONNX model from the model builder always has ops from the com.microsoft domain, the ONNX checker will eventually fail and raise a different error.

when past_seq_len= 0, the model faced when past_seq_len>=256, model faced,

The SparseAttention op requires past-present buffer sharing for it to work. You will need to use ONNX Runtime's IO Binding to bind the data pointers for the pre-allocated key-value caches to the past and present key-value cache inputs and outputs of the ONNX model. This will essentially update the key-value caches in-place. Here's how you can do that.

import onnxruntime as ort
from onnxruntime import OrtValue

# Pre-allocate the max memory used for each KV cache separately instead of using `empty_past_kv`
max_seq_len = 256
kv_caches = [np.zeros((batch_size, num_kv_heads, max_seq_len, head_size), dtype=np_dtype) for _ in range(2 * config.num_hidden_layers)]

# Create IO binding object from session
sess = ort.InferenceSession("/path/to/model.onnx")
io_binding = sess.io_binding()

# Bind input and output data pointers to IO binding object
input_names = list(map(lambda i: i.name, sess.get_inputs()))
for i, input_name in enumerate(input_names):
    np_data = None
    if input_name == "input_ids":
        np_data = input_ids
    elif input_name == "attention_mask":
        np_data = attn_mask
    else:
        # Subtract 2 since order of `input_names` will be `input_ids`, `attention_mask`, `past_key_values.0.key`, ...
        np_data = kv_caches[i - 2]

    # Create OrtValue from NumPy array and bind OrtValue as input
    ort_value = OrtValue.ortvalue_from_numpy(np_data, device_type="cpu", device_id=0)
    io_binding.bind_ortvalue_input(input_name, ort_value)

    if input_name != "input_ids" and input_name != "attention_mask":
        # Bind same data pointer for KV caches as output to update KV caches in-place
        output_name = input_name.replace("past_key_values", "present")
        io_binding.bind_ortvalue_output(output_name, ort_value)

# Bind `logits` output to CPU
io_binding.bind_output("logits", device_type="cpu", device_id=0)

# Run inference and get logits
sess.run_with_iobinding(io_binding)
logits = io_binding.get_outputs()[0].numpy()

microsoft / onnxruntime-genai

Unable to export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS) #908

Export Microsoft/Phi-3-small-8k-instruct ONNX model on CPU (Ubuntu 22.04.4 LTS)