microsoft / onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime
MIT License
295 stars 80 forks source link

[WordTokenizer] SIGSEGV when testing example code for WordTokenizer #660

Closed HadiSDev closed 4 months ago

HadiSDev commented 4 months ago

I am trying out the WordTokenizer example in the docs but for some reason I get a SIGSEGV fault when testing the saved model. Not sure what I did wrong

import onnx.helper as helper
import onnx
import onnxruntime as ort
import json
from onnxruntime_extensions import get_library_path as _lib_path

so = ort.SessionOptions()
so.register_custom_ops_library(_lib_path())

words = ["want", "##want",
         "##ed", "wa", "un", "runn", "##ing"]
vocab = {w: i + 10 for i, w in enumerate(words)}
st = json.dumps(vocab)
nodes = []
mkv = helper.make_tensor_value_info
reg = helper.make_tensor(
    "pattern", onnx.TensorProto.STRING, [1, ], ["(\\s)".encode('ascii')])
reg_empty = helper.make_tensor(
    "keep_pattern", onnx.TensorProto.STRING, [0, ], [])

nodes = [
    helper.make_node(
        'StringRegexSplitWithOffsets',
        inputs=['text', 'pattern', 'keep_pattern'],
        outputs=['words', 'begin_end', 'indices'],
        name='StringRegexPlsitOpName',
        domain='ai.onnx.contrib'),
    helper.make_node(
        'WordpieceTokenizer',
        inputs=['words', 'indices'],
        outputs=['out0', 'out1', 'out2'],
        name='WordpieceTokenizerOpName',
        domain='ai.onnx.contrib',
        vocab=st.encode('utf-8'),
        suffix_indicator="##",
        unknown_token="[UNK]")
]
inputs = [mkv('text', onnx.TensorProto.STRING, [None])]
graph = helper.make_graph(
    nodes, 'test0', inputs, [
        mkv('out0', onnx.TensorProto.STRING, [None]),
        mkv('out1', onnx.TensorProto.INT64, [None]),
        mkv('out2', onnx.TensorProto.INT64, [None]),
        mkv('words', onnx.TensorProto.STRING, [None]),
        mkv('indices', onnx.TensorProto.INT64, [None])],
    [reg, reg_empty])
model = helper.make_model(
    graph, opset_imports=[helper.make_operatorsetid("ai.onnx.contrib", 1)])

onnx.shape_inference.infer_shapes(model)
onnx.save_model(model, "model.onnx")

session = ort.InferenceSession("model.onnx", sess_options=so)
session.run(None, {"text": ["HELLO", "HI", "WORLD"]})

Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)

HadiSDev commented 4 months ago

onnxruntime = "^1.17.0" onnxruntime-extensions = "^0.10.0" onnx = "1.15.0"

HadiSDev commented 4 months ago

I think the error is with the StringRegexSplitWithOffsets

HadiSDev commented 4 months ago

Documentation is not updated. I found out that the output of both operators is 4 not 3