onnx / tensorflow-onnx

Convert TensorFlow, Keras, Tensorflow.js and Tflite models to ONNX
Apache License 2.0
2.3k stars 432 forks source link

[ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Error in Node:model/multi_category_encoding/AsString : No Op registered for AsString with domain_version of 9 #1645

Closed hanzigs closed 3 years ago

hanzigs commented 3 years ago

Below code works perfect when run in python file (python==3.9.5, tensorflow==2.5.0, keras2onnx==1.7.0, onnxruntime==1.8.0, keras==2.4.3, tf2onnx==1.9.1)

autoKeras_model = StructuredDataClassifier(max_trials=MaxTrials)
autoKeras_model.fit(x=X_train, y=y_train, validation_data=(X_valid, y_valid), epochs=Epochs, verbose=1)
ExportedautoKeras_model = autoKeras_model.export_model()

onnx_model, _ = tf2onnx.convert.from_keras(ExportedautoKeras_model )
content = onnx_model.SerializeToString()
sess = onnxruntime.InferenceSession(content)

Same code inside Flask App, InferenceSession throws error

sess = onnxruntime.InferenceSession(content)

  File "C:\Users\plg\Anaconda3\envs\automl04augpy395elk7120\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 283, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Users\plg\Anaconda3\envs\automl04augpy395elk7120\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 312, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_bytes, False, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Error in Node:model/multi_category_encoding/AsString : No Op registered for AsString with domain_version of 9
I am mainly after input_name

If that's a converter bug, how should I find the correct opset? (I have tried opset from 9 to 13, all throws error) then why that error not raised in standalone run?

Any help please, Thanks

hanzigs commented 3 years ago

OK, will try to find that and get back

hanzigs commented 3 years ago

whether these are those image

TomWildenhain-Microsoft commented 3 years ago

Those are output nodes. They are not tables. I might be able to improve our table loading. Give me 45 minutes.

hanzigs commented 3 years ago

Here is some Error Lines

image

image

hanzigs commented 3 years ago

another one image

image

TomWildenhain-Microsoft commented 3 years ago

Unfortunately what I tried didn't work. I still need either a shared_name or resource_handle for the tensorflow backend to give me the values stored in the table. The keras model must have them somewhere because it needs to pull those values to save the model. The existing search code actually traces the model save code to find the resources, but they must be overriding the defaults or something. Hard for me to say without seeing the model.

hanzigs commented 3 years ago

the key value for the search should be "resource_handle" is it?

TomWildenhain-Microsoft commented 3 years ago

the key value for the search should be "resource_handle" is it?

Or shared_name. Actually shared_name is slightly better. Ideally we want both. They are normally next to each other, and they might start with an underscore prefix ("_shared_name").

I've got another idea that I actually think will work but it's late so I'll try it tomorrow/Monday.

hanzigs commented 3 years ago

I found these few things, there is _shared_name and _handle_name Each of the _undeduplicated_weights different names for each

image

image

if this is correct, will try monday

TomWildenhain-Microsoft commented 3 years ago

1654 will hopefully fix it automatically. Try it with:

pip uninstall tf2onnx pip install git+https://github.com/onnx/tensorflow-onnx@tom/keras_hash_tables

The improved method grabs the resource handle from graph captures and makes up a shared_name if it fails to find one in the model.

hanzigs commented 3 years ago

That was perfect, It worked as expected, Thank you very much, and greatly appreciated for the support.

TomWildenhain-Microsoft commented 3 years ago

Excellent. And @hanzigs have you confirmed that the model loads in onnx runtime and produces correct results when run on the validation data?

hanzigs commented 3 years ago

Yes, python model and onnx model reproduces the same expected results on validation data

image

TomWildenhain-Microsoft commented 3 years ago

Awesome! Thanks for helping us debug this.

hanzigs commented 3 years ago

please let me know when this is ready in pip install tf2onnx

hanzigs commented 3 years ago

May I know these warnings affect anything

INFO:tf2onnx.tfonnx:Using tensorflow=2.5.0, onnx=1.10.0, tf2onnx=1.10.0/32d758
INFO:tf2onnx.tfonnx:Using opset <onnx, 9>
WARNING:tf2onnx.shape_inference:Cannot infer shape for model/multi_category_encoding/string_lookup_1/None_lookup_table_find/LookupTableFindV2: model/multi_category_encoding/string_lookup_1/None_lookup_table_find/LookupTableFindV2:0
WARNING:tf2onnx.shape_inference:Cannot infer shape for model/multi_category_encoding/Cast_1: model/multi_category_encoding/Cast_1:0
INFO:tf2onnx.tf_utils:Computed 0 values for constant folding
WARNING:tf2onnx.onnx_opset.tensor:ONNX does not support precision, scientific and fill attributes for AsString
INFO:tf2onnx.optimizer:Optimizing ONNX model
INFO:tf2onnx.optimizer:After optimization: Const -20 (29->9), Identity -2 (2->0)

sorry, there is a difference in the prediction results between python and onnx model from Flask, But in python file, both produces same results. Above is the only warnings, no errors happening.

TomWildenhain-Microsoft commented 3 years ago

Awwww, thought we had fixed this. The models from python and flask are different. Keep in mind that autokeras can choose very different model architectures depending on the data it is given. I think it is likely you are training the python and flask models on different data.

The "does not support precision, scientific and fill attributes for AsString" might or might not matter depending on how the lookup table is formatted. Can you upload the converted onnx model from flask again?

guschmue commented 3 years ago

The warning - we can't handle all attributes from AsString(), ie. instead of float 123. onnx would have float 123.000000. Not sure if it hurts in this case - it might if the category mapper is behind it because the lookup table would have the tf representation. int32, int64 should be ok, float might run into issues. Not sure when autokeras starts using AsString() - for the examples I tried I saw it always used String_To_Number().

hanzigs commented 3 years ago

Thanks for that, uploaded the converted onnx model in the drive.

Regarding different results, actually I meant the python model in flask and the same model converted to onnx in flask, those two prediction results are different

But if I build model in python file and convert to onnx, those prediction results are same, I'm very confused why is that

TomWildenhain-Microsoft commented 3 years ago

Ah, do you get the same results between the flask keras model and the flask onnx model?

I think you are almost certainly getting different models in flask and python. I'm not sure why, but I suspect you are giving autokeras different input data or different args.

hanzigs commented 3 years ago

yeah, here

onnx_model, _ = tf2onnx.convert.from_keras(model)

This is inside Flask model is python object, the prediction result is 0.60987216 onnx_model result is 0.5559953 for same test data

hanzigs commented 3 years ago

Not sure whether those warnings make any difference

TomWildenhain-Microsoft commented 3 years ago

The category mapper in the model looks like: image

That's likely not going to work. That said, I find the whole thing a little strange since the result is immediately cast back to float. Seems highly unlikely that this lookup table is useful. @hanzigs are you using real testing data on this? Do you find that the TF model produces useful results on non-training data?

Also are you certain you are running the python script with the same data, args, and virtual environment as the flask app?

hanzigs commented 3 years ago

Reg: same data, args, and virtual environment, yes I'm sure about that. because, this flask app has got 7 models, keras seq, lgbm, xgb, randomforest, extratrees, decisiontree and autokeras, all other 6 models working perfect, same way the data, args are passed, so I'm sure those are correct in flask app.

Reg cast back to float, not sure what's that, but testing data is correct,

May I understand what will be the problem for the above please

hanzigs commented 3 years ago

There is no complex code happening in autokeras

    akmodel = StructuredDataClassifier(max_trials=AK_Hyperparameters['max_trials'])
    akmodel.fit(x=X_train, y=y_train, validation_data=(X_valid, y_valid), epochs=AK_Hyperparameters['epochs'])
    autoKeras_model = akmodel.export_model()
hanzigs commented 3 years ago

Category Fields are normalized using woe transformation, Numeric fields are normalized using MinMaxScalar, separately Is this an issue

hanzigs commented 3 years ago

Testing data transformation follows the same steps of normalization for prediction

hanzigs commented 3 years ago

Flask app is bit complex to take out a miniature version, because it is linked the database which is elasticsearch on each and every step, thats why I can't send the flask app code.

TomWildenhain-Microsoft commented 3 years ago

The issue is that the input to the CategoryMapper (lookup table) comes from an AsString op in TF, which converts a number to a string. There is no corresponding op in ONNX, so we convert to a Cast, but that won't necessarily use the same precision. aka 0.0 becomes "0.0" not "0.00000". The lookup for 0.0 will return 1 not 5 and the results may be different. If we were doing int to string, it would be consistent, but float to string is more problematic.

TomWildenhain-Microsoft commented 3 years ago

Flask app is bit complex to take out a miniature version, because it is linked the database which is elasticsearch on each and every step, thats why I can't send the flask app code.

Is the data from the database used to train the autokeras model?

hanzigs commented 3 years ago

Yes it is

TomWildenhain-Microsoft commented 3 years ago

How does the python script get the data then? What data does it use?

hanzigs commented 3 years ago

python elasticsearch client to pull the data

TomWildenhain-Microsoft commented 3 years ago

Can you pickle the data from each and compare that they are identical?

hanzigs commented 3 years ago

yes i can

hanzigs commented 3 years ago

the prediction testing happens from POSTMAN

hanzigs commented 3 years ago

The issue is that the input to the CategoryMapper (lookup table) comes from an AsString op in TF, which converts a number to a string. There is no corresponding op in ONNX, so we convert to a Cast, but that won't necessarily use the same precision. aka 0.0 becomes "0.0" not "0.00000". The lookup for 0.0 will return 1 not 5 and the results may be different. If we were doing int to string, it would be consistent, but float to string is more problematic.

But if I build model step by step in a python file by calling only functions of flask app and test, it works fine

hanzigs commented 3 years ago

Anyway will check that, Thanks for the support, much appreciated, You can close this ticket.

hanzigs commented 3 years ago

Hi @TomWildenhain-Microsoft I have uploaded 4 models in the drive, Out of that, 3 model with name tf2onnx.... Is it possible to check the CategoryMapper cast back to float is in those models. Because all these models giving perfect results, but build from python file using functions from Flask app.

The 4th model having the catmapper build from flask app, not giving correct result Thanks

hanzigs commented 3 years ago

I visualized in netron, couldn't find Categorymapper in the 3 models image Not sure why it's not there

model created with flask having it image image

Not sure what's the Flask app making difference, why Categorymapper showing up in flask model not in python file model

hanzigs commented 3 years ago

Hi @TomWildenhain-Microsoft whether 'tom/keras_hash_tables' branch not available for installation? Thanks

hanzigs commented 3 years ago

Hi, Added a Colab notebook with data and autokeras model building with the prediction difference (shared in the drive) https://colab.research.google.com/drive/1DqlJgGZuKf5nev9G6Do7DYEMEU4aAQhy

https://drive.google.com/drive/folders/1HfB00dOuk-awSmIrSg92hmJFYzTpQNCr?usp=sharing

Let me know whether its possible to convert, Thanks