aimet_torch 1.28 can not export my onnx model normally

zhuoran-guo commented 1 year ago

I can export onnx model after quant_sim() in aimet_torch 1.27, but can not export the model in aimet_torch 1.28, When I export the model in 1.28

    quant_sim = QuantizationSimModel(
        prepared_model,
        dummy_input=dummy_input,
        quant_scheme=QuantScheme.training_range_learning_with_tf_init,
        config_file=config_file_path,
    )

    quant_sim.compute_encodings(
        forward_pass_callback=pass_calibration_data, forward_pass_callback_args=use_cuda,
    )

    onnx_export_args = {
        'opset_version': 11,
        'verbose': True,
        'input_names': [f"input{i}" for i in range(5)],
        'output_names':[f"output{i}" for i in range(5)],
        'dynamic_axes': None,
        'keep_initializers_as_inputs': True,
    }

    quant_sim.export(new_dir_path, "aimet_e2e_ptq", dummy_input=get_dummy_input_a(model, 128, False), onnx_export_args=onnx_export_args)

the error log seems that the default version of AIMET for PyTorch is 1.13 and only supports export with ONNX opset 14. However, here on https://github.com/quic/aimet/releases, it is mentioned that AIMET 1.28 is still compatible with PyTorch version 1.9. So, I'd like to know if it's possible to export a model that supports ONNX opset 11 in an AIMET 1.28 environment? This is because I need AIMET 1.28 for further Quantization-Aware Training (QAT) on my quantized models.

Thanks for you help !

Traceback (most recent call last):
  File "incam_e2e_aimet.py", line 318, in <module>
    main()
  File "incam_e2e_aimet.py", line 303, in main
    quant_sim.export(new_dir_path, "aimet_e2e_ptq", dummy_input=get_dummy_input_a(model, 128, False), onnx_export_args=onnx_export_args)
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/quantsim.py", line 431, in export
    self.export_onnx_model_and_encodings(path, filename_prefix, model_to_export, self.model,
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/quantsim.py", line 504, in export_onnx_model_and_encodings
    OnnxSaver.create_onnx_model_with_pytorch_layer_names(onnx_path, original_model, dummy_input, is_conditional,
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py", line 378, in create_onnx_model_with_pytorch_layer_names
    cls.set_node_names(onnx_model_path, pytorch_model, dummy_input, is_conditional, module_marker_map, onnx_export_args)
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py", line 400, in set_node_names
    onnx_model = cls._map_onnx_nodes_to_pytorch_modules(pytorch_model, dummy_input,
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py", line 522, in _map_onnx_nodes_to_pytorch_modules
    onnx_model, onnx_model_all_marker = cls._create_onnx_model(dummy_input, is_conditional, module_marker_map,
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py", line 577, in _create_onnx_model
    onnx_model = cls._create_onnx_model_with_markers(dummy_input, pt_model, working_dir, onnx_export_args,
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py", line 1066, in _create_onnx_model_with_markers
    cls._export_model_to_onnx(model, dummy_input, temp_file, is_conditional, onnx_export_args)
  File "/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py", line 1456, in _export_model_to_onnx
    torch.onnx.export(model, dummy_input, temp_file, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py", line 504, in export
    _export(
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py", line 1529, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py", line 1115, in _model_to_graph
    graph = _optimize_graph(
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py", line 663, in _optimize_graph
    graph = _C._jit_pass_onnx(graph, operator_export_type)
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py", line 1899, in _run_symbolic_function
    return symbolic_fn(graph_context, *inputs, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_helper.py", line 303, in wrapper
    return fn(g, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_opset9.py", line 1040, in unsafe_chunk
    return symbolic_helper._unimplemented(
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_helper.py", line 577, in _unimplemented
    _onnx_unsupported(f"{op}, {msg}", value)
  File "/usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_helper.py", line 588, in _onnx_unsupported
    raise errors.SymbolicValueError(
torch.onnx.errors.SymbolicValueError: Unsupported: ONNX export of operator unsafe_chunk, unknown dimension size. Please feel free to request support or submit a pull request on PyTorch GitHub: https://github.com/pytorch/pytorch/issues  [Caused by the value '628 defined in (%628 : Float(*, *, strides=[400, 1], requires_grad=1, device=cpu) = onnx::Add(%623, %627), scope: torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl::/aimet_torch.onnx_utils.CustomMarker::lstm1/torch.nn.modules.rnn.LSTMCell::marked_module # /usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py:1194:0
)' (type 'Tensor') in the TorchScript graph. The containing node has kind 'onnx::Add'.] 
    (node defined in /usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py(1194): forward
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1182): _slow_forward
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1194): _call_impl
/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py(272): forward
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1182): _slow_forward
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1194): _call_impl
<eval_with_key>.5(167): forward
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1182): _slow_forward
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1194): _call_impl
/usr/local/lib/python3.8/dist-packages/torch/fx/graph_module.py(267): __call__
/usr/local/lib/python3.8/dist-packages/torch/fx/graph_module.py(658): call_wrapped
/usr/local/lib/python3.8/dist-packages/torch/jit/_trace.py(118): wrapper
/usr/local/lib/python3.8/dist-packages/torch/jit/_trace.py(127): forward
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1194): _call_impl
/usr/local/lib/python3.8/dist-packages/torch/jit/_trace.py(1184): _get_trace_graph
/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py(891): _trace_and_get_graph_from_model
/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py(987): _create_jit_graph
/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py(1111): _model_to_graph
/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py(1529): _export
/usr/local/lib/python3.8/dist-packages/torch/onnx/utils.py(504): export
/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py(1456): _export_model_to_onnx
/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py(1066): _create_onnx_model_with_markers
/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py(577): _create_onnx_model
/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py(522): _map_onnx_nodes_to_pytorch_modules
/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py(400): set_node_names
/usr/local/lib/python3.8/dist-packages/aimet_torch/onnx_utils.py(378): create_onnx_model_with_pytorch_layer_names
/usr/local/lib/python3.8/dist-packages/aimet_torch/quantsim.py(504): export_onnx_model_and_encodings
/usr/local/lib/python3.8/dist-packages/aimet_torch/quantsim.py(431): export
incam_e2e_aimet.py(303): main
incam_e2e_aimet.py(318): <module>
)

    Inputs:
        #0: 623 defined in (%623 : Float(1, 400, strides=[400, 1], requires_grad=1, device=cpu) = onnx::Gemm[alpha=1., beta=1.](%1, %620, %inner_model.marked_module.decoder.marked_module.lstm1.marked_module.bias_hh), scope: torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl::/aimet_torch.onnx_utils.CustomMarker::lstm1/torch.nn.modules.rnn.LSTMCell::marked_module # /usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py:1194:0
    )  (type 'Tensor')
        #1: 627 defined in (%627 : Float(*, *, strides=[400, 1], requires_grad=1, device=cpu) = onnx::Gemm[alpha=1., beta=1.](%619, %624, %inner_model.marked_module.decoder.marked_module.lstm1.marked_module.bias_ih), scope: torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl::/aimet_torch.onnx_utils.CustomMarker::lstm1/torch.nn.modules.rnn.LSTMCell::marked_module # /usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py:1194:0
    )  (type 'Tensor')
    Outputs:
        #0: 628 defined in (%628 : Float(*, *, strides=[400, 1], requires_grad=1, device=cpu) = onnx::Add(%623, %627), scope: torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl::/aimet_torch.onnx_utils.CustomMarker::lstm1/torch.nn.modules.rnn.LSTMCell::marked_module # /usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py:1194:0
    )  (type 'Tensor')

I even can export the onnx model in aimet_torch 1.27 because the torch_version in aimet_torch 1.27 is 1.9.1+cu111 but in aimet_torch 1.28 is 1.13.1+cu116

zhuoran-guo commented 1 year ago

If I install the torch 1.9.1 + cu111 in aimet_1.28 environment it will get the error : ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. aimettorch torch-gpu-1.28.0 requires torch==1.13.1+cu116, but you have torch 1.9.1+cu111 which is incompatible. aimettorch torch-gpu-1.28.0 requires torchvision==0.14.1+cu116, but you have torchvision 0.10.1+cu111 which is incompatible.

quic-mangal commented 1 year ago

Can you try with opset 11 or 12?

zhuoran-guo commented 1 year ago

@quic-mangal hello, thanks for your response, you means set the opset = 11 or 12 use onnx_export_args like this below ? I have already try something like this but not work yet

onnx_export_args = {
        'opset_version': 11,
        'verbose': True,
        'input_names': [f"input{i}" for i in range(5)],
        'output_names':[f"output{i}" for i in range(5)],
        'dynamic_axes': None,
        'keep_initializers_as_inputs': True,
    }

quant_sim.export(new_dir_path, "aimet", dummy_input, onnx_export_args=onnx_export_args)

quic-mangal commented 1 year ago

Are you able to export this model from torch to ONNX without AIMET being in the middle? Because from the error it says-

torch.onnx.errors.SymbolicValueError: Unsupported: ONNX export of operator unsafe_chunk, unknown dimension size. Please feel free to request support or submit a pull request on PyTorch GitHub: https://github.com/pytorch/pytorch/issues  [Caused by the value '628 defined in (%628 : Float(*, *, strides=[400, 1], requires_grad=1, device=cpu) = onnx::Add(%623, %627), scope: torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl::/aimet_torch.onnx_utils.CustomMarker::lstm1/torch.nn.modules.rnn.LSTMCell::marked_module # /usr/local/lib/python3.8/dist-packages/torch/nn/modules/rnn.py:1194:0
)' (type 'Tensor') in the TorchScript graph. The containing node has kind 'onnx::Add'.]

zhuoran-guo commented 1 year ago

@quic-mangal Thanks for your response, Yes I can export this model from torch to ONNX directly without AIMET for example just use this function in the script : it can work well: pytorch2onnx(model, input_size=128, output_file='/work/incam-qat/model/test.onnx') but it can not save onnx sucessful : quant_sim.export(new_dir_path, "aimet_e2e_ptq", dummy_input=get_dummy_input_a(model, 128, False), onnx_export_args=onnx_export_args)

zhuoran-guo commented 1 year ago

@quic-mangal but if I comment out this line, it can save onnx model successful, it seems that the problem is happen here: https://github.com/quic/aimet/blob/ae983476d09e863f9973a586f32fa5acd2c5217e/TrainingExtensions/torch/src/python/aimet_torch/onnx_utils.py#L1061 I want to convert the ONNX model to a DLC model and deploy it on SNPE later, Is it feasible if I ignore this line of code?

snpe-onnx-to-dlc -i aimet.onnx --quantization_overrides aimet.encodings -o aimet.dlc
snpe-dlc-quantize --input_dlc aimet.dlc --input_list ../input_list.txt --output_dlc  aimet_quantization.dlc --override_params

quic-mangal commented 1 year ago

@quic-akinlawo, could you take up this last question. Thanks

quic-akinlawo commented 1 year ago

@zhuoran-guo Commenting out that line may be a non-issue depending on your model (i.e. it may work fine). In certain cases, such as a one-to-many pytorch to onnx op mapping, you could see mismatched encoding values between SNPE and AIMET if markers are not added correctly.

zhuoran-guo commented 1 year ago

@quic-akinlawo @quic-mangal hi, thanks for your response, I checked the action, and if I simply comment out this line

although the ONNX model can be successfully exported, there will be a mismatch between the ONNX layer names and PyTorch. As a result, the following warning will be raised in the subsequent steps, and due to this warning, the generated encoding file will also be empty:

The following layers were not found in the exported onnx model. Encodings for these layers will not appear in the exported encodings file:
['encoder.model.conv_stem', 'encoder.model.act1', .......]

it seems due to the layers_to_onnx_op_names is empty due to I just commented out this line Then, based on the warning message, I customized the value of the layers_to_onnx_op_names , such as like {'encoder.model.conv_stem': ['encoder.model.conv_stem'], 'encoder.model.act1'': ['encoder.model.act1''], ......} and the onnx model can export successful and generate encoding file with parameters .

However, I do not have an in-depth understanding of the source code, I'm not sure whether this approach is correct or if it may introduce some errors ? Additionally, regarding the export error before, can I share a minimal script to your team to reproduce the issue and then analyze how to resolve this bug? Thank you. (it seems happen due to the lstm cell)

quic-hitameht commented 1 year ago

@zhuoran-guo Please share minimal script to reproduce the issue.

zhuoran-guo commented 1 year ago

@quic-hitameht thank you, you can try to run this script to reproduce the issue , it seems the issue happen because the LSTMCell part the environment is aimet_torch_gpu=1.28 with 1.13.1+cu116 if you have any problem please tell me

import os
import torch
import torch.cuda
import torch.nn as nn
from aimet_common.defs import QuantScheme
from aimet_torch.model_preparer import prepare_model
from aimet_torch.quantsim import QuantizationSimModel, OnnxExportApiArgs

from aimet_torch.batch_norm_fold import fold_all_batch_norms
import timm

class ModelA(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = ConvEncoder()
        self.decoder = LSTMDecoder(
            self.encoder.model.num_features,
        )

    def forward(self, x, h_prev, c_prev,):
        x = self.encoder(x)
        return self.decoder(x, h_prev, c_prev)

class LSTMDecoder(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.hidden_dim = 100
        self.lstm = nn.LSTMCell(input_dim, 100)

    def init_states(self, batch_size: int, device: torch.device):
        h_next = [torch.zeros(batch_size, self.hidden_dim).to(device) for _ in range(1)]
        c_next = [torch.zeros(batch_size, self.hidden_dim).to(device) for _ in range(1)]
        return h_next, c_next

    def forward(self, x, h, c):
        h, c = self.lstm(x, (h[0], c[0]))
        return h, c

class ConvEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        name = "efficientnet_lite0"
        assert name in timm.list_models(pretrained=False), name
        self.model = timm.create_model(name, pretrained=False, num_classes=0)

    def forward(self, x) :
        return self.model(x)

def get_dummy_input_a(model: nn.Module, input_size: int, use_cuda: bool):
    # (B, C, H, W)
    if use_cuda:
        device = torch.device('cuda')
        input_shape = (1, 3, input_size, input_size)
        data = torch.randn(input_shape, requires_grad=False).to(device)
        h, c = model.decoder.init_states(1, "cuda")
    else:
        device = torch.device('cpu')
        input_shape = (1, 3, input_size, input_size)
        data = torch.randn(input_shape, requires_grad=False).to(device)
        h, c = model.decoder.init_states(1, "cpu")
    return data, h, c

def main():
    model = ModelA()
    prepared_model = prepare_model(model)
    use_cuda = False
    if torch.cuda.is_available():
        use_cuda = True
        prepared_model.to(torch.device('cuda'))

    dummy_input = get_dummy_input_a(model, 128, use_cuda)
    _ = fold_all_batch_norms(prepared_model, input_shapes=None, dummy_input=get_dummy_input_a(model, 128, use_cuda))

    quant_sim = QuantizationSimModel(
        prepared_model,
        dummy_input=dummy_input,
        quant_scheme=QuantScheme.training_range_learning_with_tf_init,
    )

    new_dir_path = '/work/incam-qat/model_test'

    os.makedirs(new_dir_path, exist_ok=True)

    onnx_export_args = {
        'opset_version': 11,
        'verbose': True,
        'input_names': [f"input{i}" for i in range(3)],
        'output_names':[f"output{i}" for i in range(3)],
        'dynamic_axes': None,
        'keep_initializers_as_inputs': True,
    }

    quant_sim.export(new_dir_path, "aimet_e2e_ptq", dummy_input=get_dummy_input_a(model, 128, False), onnx_export_args=onnx_export_args)

if __name__ == "__main__":
    main()

quic-hitameht commented 1 year ago

@zhuoran-guo Thanks for sharing the script. We'll take a detailed look.

Meanwhile, have you tried using torch.nn.LSTM instead of torch.nn.LSTMCell as suggested by this issue from ONNX repository: https://github.com/onnx/onnx/issues/3597

zhuoran-guo commented 1 year ago

@quic-hitameht Yes, if I use torch.nn.LSTM instead of torch.nn.LSTMCell, I can export the ONNX model without errors. However, my model was trained with torch.nn.LSTMCell, and I need to perform Quantization-Aware Training (QAT) using Aimet. Therefore, if it is possible, I would like Aimet to support the export of models using LSTMCell.

As I discussed with @quic-mangal previously, the model can be exported successfully without Aimet. It appears that there is a bug occurring during the Aimet export process at this point: if I simply comment out this line, Aimet is able to export the ONNX model successfully. https://github.com/quic/aimet/blob/ae983476d09e863f9973a586f32fa5acd2c5217e/TrainingExtensions/torch/src/python/aimet_torch/onnx_utils.py#L1061

quic-bharathr commented 1 year ago

Hi @zhuoran-guo While the source code remains compatible with PyTorch 1.9, the release wheel package files are not. We will need to rebuild the package for PyTorch 1.9. We make these packages available for the 1.28 OS release shortly.

zhuoran-guo commented 1 year ago

@quic-bharathr Thank you for the update. We appreciate your efforts in ensuring compatibility with PyTorch 1.9. We look forward to the release of the updated packages for the 1.28 OS release. Please keep us informed of any further developments or if you need any assistance from our end. It seems that I can export the model after the update in 1.28

quic / aimet

aimet_torch 1.28 can not export my onnx model normally #2475