microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.1k stars 2.84k forks source link

Intel OneDNN #20208

Open ste-q opened 5 months ago

ste-q commented 5 months ago

Describe the issue

I have quantized sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model using the below script.

class PytorchModel(nn.Module):
    def __init__(self, model_path: str):
        super().__init__()
        self.model = SentenceTransformer(model_name)

    def forward(
        self, input_ids: torch.Tensor, attention_mask: torch.Tensor
    ) -> torch.Tensor:
        features = {}
        features["input_ids"] = input_ids
        features["attention_mask"] = attention_mask
        # output = self.feed_forward(inputs)
        output = self.model(features)
        # return output
        return output["sentence_embedding"]        

py_model = PytorchModel(model_path)
py_model.eval()
tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_path)
inputs = tokenizer.prepare_for_model(
    tokenizer.convert_tokens_to_ids(tokenizer.tokenize("i am fine"))
)
attention_mask = torch.tensor([inputs["attention_mask"]])
input_ids = torch.tensor([inputs["input_ids"]])
torch.onnx.export(
    py_model,
    (input_ids, attention_mask),
    "bert.onnx",
    opset_version=13,
    do_constant_folding=True,
    input_names=["input_ids", "attention_mask"],
    output_names=["output"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sentence_length"},
        "attention_mask": {0: "batch_size", 1: "sentence_length"},
    },
)
onnx_model_path = "bert.onnx"
quantized_model_path = "bert_quantized.onnx"
quantize_dynamic(
        model_input=onnx_model_path,
        model_output=quantized_model_path,
)

I generated the build using the below command

./build.sh --config RelWithDebInfo --build_shared_lib --parallel --enable_training --skip_tests  --build_java --use_dnnl

When ever I infer the quantized model in java using One DNN executive provider, I am getting the below error.

2024-04-05 12:37:23.831441312 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running DNNL_9692988425953928956_1 node. Name:'DnnlExecutionProvider_DNNL_9692988425953928956_1_1' Status Message: /onnxruntime/onnxruntime/core/providers/dnnl/subgraph/dnnl_dequantizelinear.cc:191 void onnxruntime::ort_dnnl::DnnlDequantizeLinear::ValidateDims(onnxruntime::ort_dnnl::DnnlSubgraphPrimitive&, onnxruntime::ort_dnnl::DnnlNode&) x_scale and x_zero_point dimensions does not match

Please note that when I remove options.addDnnl(true); from the session options, the same model and script work well. I tried running the ONNX model (not quantized), and it also works fine.

This issue occurs when I infer the model with different inputs. For example, if I send the input "test" during the first inference, I receive the corresponding vector. However, in the second model call, when I try inputs other than "test", it shows me an error.

To reproduce

Please find the models and Jar here

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04.3

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

Java

Architecture

X64

Execution Provider

oneDNN

Execution Provider Library Version

No response

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.