swook / GazeML

Gaze Estimation using Deep Learning, a Tensorflow-based framework.
MIT License
522 stars 140 forks source link

[TUTORIAL] How to save ELG model to .onnx and further to TensorRT .engine #90

Open Hyrtsi opened 2 years ago

Hyrtsi commented 2 years ago

Hi all. I have spent quite some time reading and using this awesome code. Converting the model to .onnx and .engine wasn't too easy so I share how I did it.

Installation

Ok, time to install everything!

pip install --upgrade pip
pip install cython
pip install scipy
python3 setup.py install
pip install tensorflow==1.14

For me tensorflow==1.15 didn't work. You can also install tensorflow-gpu. Make sure it's the same version or check the support matrix on Tensorflow page. Note that most of tf1.x stuff is deprecated so it's hard to get support for that. I'm thinking of implementing this whole repo in pytorch or tf2 for that reason.

If python3 setup.py install hangs just install the dependencies by hand one by one.

Get the pre-trained weights: bash get_trained_weights.bash


Running the model

Before converting anything test the model:

cd src
python3 elg_demo.py

I got a ton of errors but the model worked nonetheless.

Saving the model as .onnx

Use this tool: tf2onnx

pip install -U tf2onnx

Then, we have to modify the code a bit before we can get started.

Save the saved-model in inference_generator()

Add these code lines before line 385 yield outputs:

            # Save saved-model
            tf.saved_model.simple_save(self._tensorflow_session, "tmp",
                 inputs=data_source.output_tensors, outputs=fetches)

When you run this code again (python3 elg_demo.py) it will create a folder tmp with the saved_model.pb in it. But don't run it yet because if you try to convert the code you will get this error:

ValueError: Input 0 of node hourglass/pre/BatchNorm/cond_1/AssignMovingAvg/Switch was passed float from hourglass/pre/BatchNorm/moving_mean:0 incompatible with expected float_ref.

The error is actually quite good: it tells where in the graph we got problems. BatchNorm is creating the problem. There are quite many answers on Google about this issue but I think the easiest way to fix it is to set training to False as BatchNorm behaves differently when training / when testing. Change at least these lines:

and optionally:

This is a bug in the code: self.use_batch_statistics is set to True everywhere but it isn't set to False at any point. I could create a PR for this.

Now we have done all the changes.

You can convert that file to `.onnx. like so:

python3 -m tf2onnx.convert --saved-model ./tmp --output gazeml.onnx

For most of your needs that should be enough. You can add --opset <opset> for example --opset 10 if you want to target a specific opset. You can also add --target tensorrt or similar. Check the tf2onnx repo for more flags if you need them.

There's one more thing you should know.

Converting the model to TensorRT .engine

If you try to convert the model using tools like trtexec or similar you'll end up with a small problem. The model contains uint8 but it's not supported by TensorRT. You must remove the uint8's in the model like this:

here change uint8 to int64 and it will work.

Then you can convert:

trtexec --onnx gazeml.onnx --saveEngine gazeml.engine --buildOnly --verbose --best

or using onnx2trt:

onnx2trt gazeml.onnx -o gazeml.engine

That should be it. Thank you! I hope my weeks of grinding helps someone. Please ask me if there are any questions.

Hyrtsi commented 2 years ago

You can check the .onnx model graph by using this awesome tool: https://netron.app/