microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.51k stars 2.76k forks source link

[Training] IR version incompatibility in artifact generation for on-device training #20726

Open tomaz-suller opened 1 month ago

tomaz-suller commented 1 month ago

Describe the issue

Trying to execute the example notebook provided in on_device_training/desktop/python/mnist.ipynb results in an error about IR version incompatibility, stating the optimiser only supports version <=9 while the generated artifacts use version 10.

To reproduce

  1. Install on-device training dependencies for offline stage as instructed here
  2. Install additional dependencies to execute the notebook
    ipykernel
    ipywidgets
    torch
    torchvision
    matplotlib
    netron
    evaluate

    (initially added them to requirements.txt, then installed one-by-one after each ImportError to check if that wasn't the problem)

  3. Execute notebook until the first cell of section "3 - Initialize Module and Optimizer"; no errors should be raised
  4. Execute first cell of the section

    # Create checkpoint state.
    state = CheckpointState.load_checkpoint("data/checkpoint")
    
    # Create module.
    model = Module("data/training_model.onnx", state, "data/eval_model.onnx")
    
    # Create optimizer.
    optimizer = Optimizer("data/optimizer_model.onnx", model)

    which should raise the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[18], line 8
      5 model = Module(\"data/training_model.onnx\", state, \"data/eval_model.onnx\")
      7 # Create optimizer.
----> 8 optimizer = Optimizer(\"data/optimizer_model.onnx\", model)

File venv/lib/python3.12/site-packages/onnxruntime/training/api/optimizer.py:24, in Optimizer.__init__(self, optimizer_uri, module)
     23 def __init__(self, optimizer_uri: str | os.PathLike, module: Module):
---> 24     self._optimizer = C.Optimizer(
     25         os.fspath(optimizer_uri), module._state._state, module._device, module._session_options
     26     )

RuntimeError: /onnxruntime_src/orttraining/orttraining/training_api/optimizer.cc:273 void onnxruntime::training::api::Optimizer::Initialize(const onnxruntime::training::api::ModelIdentifiers&, const std::vector<std::shared_ptr<onnxruntime::IExecutionProvider> >&, gsl::span<OrtCustomOpDomain* const>) [ONNXRuntimeError] : 1 : FAIL : Load model from data/optimizer_model.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const onnxruntime::PathString&, const onnxruntime::IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

Urgency

I need to develop on top of this for a project due next month.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.3

PyTorch Version

2.3.0+cu121

Execution Provider

ROCm

Execution Provider Library Version

ROCm 6.0.2

tomaz-suller commented 1 month ago

I suspect some incompatibility due to versions of system or other Python packages could be to blame, since I'm running EndeavourOS (rolling release, Arch-based) with Python 3.12.3.

I tried downgrading the onnx to 1.14.1 but I got a build error from absl complaining my compiler didn't support C++14 (which is weird since it should but I just gave up then).

tomaz-suller commented 1 month ago

Just checked and also in Google Colab I get the same error following the same steps I mentioned, but running on CPU and in Python 3.10.12

carzh commented 1 month ago

@tomaz-suller what version of ONNX are you using? If you haven't already, could you try with onnx==1.15.0? Also, what version of onnxruntime-training are you using?

tomaz-suller commented 1 month ago

It does work with onnx==1.15.0 in Colab. I'm using onnx-training-cpu==1.17.3

Edit: locally, I get the ABSL build error about C++14 I mentioned when trying to downgrade, but then the issue isn't with ONNX anymore.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.