ONNX Export Checkpoints

zalo commented 2 years ago

It would be nice if trained Isaac Gym models checkpointed to ONNX as well (for additional portability to game engines).

Here, I'll document the steps in my adventure so far in getting .onnx models out of the system:

Update the PyTorch version to 1.12. I did this by updating isaacgym's Dockerfile's base image to nvcr.io/nvidia/pytorch:22.04-py3. Thankfully, it just works. This is necessary because 1.10's ONNX Exporter can't handle the random normal-distribution operation. If ONNX Export were to be made standard, this would need to be propagated to the proper Isaac Gym preview: https://developer.nvidia.com/isaac-gym
Augment the run.sh command with another volume to allow for retrieval of the trained models (adding -v /home/gymuser/IsaacGymEnvs:/home/gymuser/IsaacGymEnvs is sufficient; we're going to pull this repo into that folder in the next step).

After starting the container, run:

git clone https://github.com/NVIDIA-Omniverse/IsaacGymEnvs.git ~/IsaacGymEnvs
pip install -q -e ~/IsaacGymEnvs

At this point, we're going to want to make some changes to the isaacgymenvs/learning/common_agent.py (to write .onnx models with each checkpoint). I did this by attaching a VS Code instance with the Docker Extension, but nano works as well. At the top under from torch import optim, add:

from torch.onnx import utils as onnx_utils

Underneath where it says self.save(self.model_output_file + "_" + str(epoch_num)), add:

                    # Grab some dummy inputs for the onnx.export function
                    input_dict = self.dataset[len(self.dataset)-1]
                    input_dict['is_train'] = False
                    input_dict['prev_actions'] = input_dict['actions']
                    # Reduce the Batch Size down to 1 for Inference
                    for key in input_dict:
                        if key != "is_train": # Ignore the bool
                            input_dict[key] = input_dict[key][0:1]
                            print("Name: ", key, ", Shape: ", input_dict[key].shape)
                    # If the conversion is about to fail, print the unconvertible ops:
                    torch_script_graph, unconvertible_ops = onnx_utils.unconvertible_ops(
                        self.model, input_dict, opset_version = 11)
                    if len(unconvertible_ops) > 0:
                        print("Operations Incompatible with ONNX Export: ", unconvertible_ops)
                    else:
                        torch.onnx.export(self.model, input_dict, self.model_output_file + "_" + str(epoch_num)+".onnx",
                                        export_params       = True,  # Whether to store the trained parameter weights inside the model file
                                        opset_version       = 11,    # The ONNX version to export the model to
                                        do_constant_folding = True,  # Whether to execute constant folding (hardcoding of constant nodes) for optimization
                                        input_names         = ['is_train', 'prev_actions', 'amp_obs_demo', 'amp_obs_replay', 'amp_obs', 'obs', 'old_values', 'old_logp_actions'], # Not sure these are correct
                                        output_names        = ['advantages', 'returns', 'actions', 'mu', 'sigma' ], # Likewise, I'm not sure if these are correct...
                                        verbose             = False) # If True, print a model summary to the console

It would be nice if one of the authors could check my work here; I'm not sure if I have the names of the input and output tensors correct... @gavrielstate

After training for >50 epochs (via python train.py task=HumanoidAMP or somesuch), you should be seeing .onnx checkpoints dumped alongside your PyTorch .pth checkpoints. I've attached an example HumanoidAMP checkpoint ( HumanoidAMP_3455_ONNX.zip ), which can be inspected in https://netron.app/ (and hopefully run in Unity's Barracuda Evaluator; I haven't tested it yet).

There's a strong chance I'm not properly accounting for inputs, persistent state, or the AMP actor critic properly... but I'm hoping my explorations here help lay the groundwork for more comprehensive ONNX support and portability across the Isaac Gym Ecosystem.

Thank you for your consideration.

annan-tang commented 2 years ago

Hi, I am a little confused about your option of upgrade to 1.12. From pytorch/issues/30517, we know that 1.10 already supports the random normal-distribution. did you test that? I'm very glad to get your feedback

zalo commented 2 years ago

Hi Annan; it’s been a while since I had this set up. I’d have to run the steps again to see the exact error message, but I recall it being pernicious until upgrading the PyTorch version, and relating to the random normal distribution function.

Perhaps you can replicate the steps with the current version and see if it still happens?

annan-tang commented 2 years ago

Hi, I guess it is because the torch.onnx.export() use the default version=9 for opeset_version in pytorch 1.10. You can manually set it with opset_version=11, then all things goes well. My current pytorch version is 1.10, it works well. I tested it today.

Back to the original topic, I think it would be a better choice that you only export onnx after checking it is a mature policy. And only export the self.player.model.a2cnetwork + self.player.model.running_mean_std part. These two parts are enough for inference.

Denys88 commented 2 years ago

Hi I created this examples how to export. I wasn't able to make it work with torch distributions so I created simple wrapper which calls normalization. You can take a look here: https://github.com/Denys88/rl_games#quickstart-colab-in-the-cloud (all links work in the google colab)

gemaRincon commented 2 years ago

I am trying to do the same job as you and get the .pth from Isaac gym examples converted to .onnx by copying your script I got the error below, could you tell me what I am doing wrong to not get the converison function to work?

I don't know if the image will be displayed so I'll write the error directly to you in case you can tell me what I'm doing wrong : TypeError: forward() missing 1 required positional argument: 'input_dict'.

error

zalo commented 2 years ago

Sorry friend, I haven't seen that error before.

isaac-sim / IsaacGymEnvs

ONNX Export Checkpoints #43