onnx / tensorflow-onnx

Convert TensorFlow, Keras, Tensorflow.js and Tflite models to ONNX
Apache License 2.0
2.3k stars 432 forks source link

Problem converting Tensorflow checkpoint created from library #804

Closed SestoAle closed 3 years ago

SestoAle commented 4 years ago

Describe the bug Hi, I created a checkpoint file from an external library using Tensorflow 1.15, but it says:

Traceback (most recent call last): File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tf2onnx-1.6.0-py3.6.egg/tf2onnx/convert.py", line 161, in <module> File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tf2onnx-1.6.0-py3.6.egg/tf2onnx/convert.py", line 119, in main File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tf2onnx-1.6.0-py3.6.egg/tf2onnx/loader.py", line 76, in from_checkpoint File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1453, in import_meta_graph **kwargs)[0] File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1477, in _import_meta_graph_with_return_elements **kwargs)) File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/framework/meta_graph.py", line 891, in import_scoped_meta_graph_with_return_elements ops.prepend_name_scope(value, scope_to_prepend_to_names)) File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3607, in as_graph_element return self._as_graph_element_locked(obj, allow_tensor, allow_operation) File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3667, in _as_graph_element_locked "graph." % repr(name)) KeyError: "The name 'inner-optimizer.step/Adam' refers to an Operation not in the graph."

Is there a simple solution to this?

System information

jignparm commented 4 years ago

The name 'inner-optimizer.step/Adam' refers to an Operation not in the graph."

It seems like some training nodes crept into the model. Can you remove them? Also, if you are able convert from checkpoint into frozen model, it'll ensure that the graph is in a good state to be converted to Onnx.

graph = tf.graph_util.remove_training_nodes(graph)
SestoAle commented 4 years ago

I've passed to TF2.0 and I get another error message:

During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/sestini/Scrivania/DeepCrawlTensorForce/python-rl/cancellare.py", line 137, in tf.import_graph_def(graphdef_inf, name='') File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def producer_op_list=producer_op_list) File "/home/sestini/miniconda3/envs/new_dcenv/lib/python3.6/site-packages/tensorflow_core/python/framework/importer.py", line 505, in _import_graph_def_internal raise ValueError(str(e)) ValueError: Input 0 of node agent.act/ResourceScatterNdUpdate was passed float from agent/global_in-buffer:0 incompatible with expected resource. Process finished with exit code 1

I've tried your solution, removing the training nodes, but I received always the same error.

I've also tried to freeze the model but I got this error instead:

File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2195, in __getitem__ return self._inputs[i] IndexError: list index out of range

guschmue commented 4 years ago

tf.graph_util.remove_training_nodes will be done internally by tf2onnx. But based on your stack trace we don't get that far - seems like tensorflow itself has issues loading the checkpoint. The second stack trace looks similar. Is this a public model we can download ?

For tf-2.x and tf-1.15 you might need tf2onnx changes. There is a PR for this here https://github.com/onnx/tensorflow-onnx/pull/803

SestoAle commented 4 years ago

Here the checkpoint files of my saved model: https://drive.google.com/file/d/18yd1BrODLFs26rXv3pRfZBAq0SEKd3_U/view?usp=sharing

I used TF2.0 for this.

As I said, I got this model from a Reinforcement Learning Library based on TensorFlow, so I don't know exactly how the graph is created.

guschmue commented 4 years ago

thanks, I'm going to take a look.

SestoAle commented 4 years ago

Any luck?

guschmue commented 4 years ago

what are the input and output names for the model?

SestoAle commented 4 years ago

Sorry, I forgot to mention those.

The inputs should be: agent/global_in-input:0,agent/local_in-input:0,agent/local_in_two-input:0,agent/prev_action-input:0,agent/stats-input:0

while the output: agent.act/action-output:0

Thanks for the help!

guschmue commented 4 years ago

thanks, I can reproduce it now. So we import the checkpoint, extract the inference graph, run the tf optimizer on it and than load that into a session with the last step failing. Need to think a little how this would happen.

svetlanadataper commented 4 years ago

Hi,

I have the same problem converting my model from saved.model format to onnx.

I am getting the same exact exception (except for input names). Any idea what can be causing this?

Thank you

Svetlana

jignparm commented 4 years ago

@svetlanadataper, your error might be slightly different (since it's a saved_model instead of checkpoint). Would you be able share your model (the original model seems to not be available)? Please feel free to open a separate issue if needed/helpful.

svetlanadataper commented 4 years ago

Thank you for your reply, unfortunately, it's a proprietary model I cannot share. Do you have any clue as to why I could be getting this error?

jignparm commented 4 years ago

Do you have any clue as to why I could be getting this error?

Can you check if the model can be converted to a frozen_model successfully, using TensorFlow API? See the freeze_session() function in tf_loader.py, or else the README.md file of this repo as an example of how to do that. If the model cannot be frozen (i.e. some of the variables cannot be converted to constants), it may require modifying the model to remove the offending variables.

If the above succeeds, and you are able to generate a frozen_model.pb file successfully, run the latest package of tf2onnx, using --opset 12 to convert the frozen_model.pb. The error logs in this process should be helpful to figure out what's causing the conversion error.

svetlanadataper commented 4 years ago

Thank you so much I will test this out and let you know!

guschmue commented 3 years ago

assume this is resolved.