[Training] Training Onnx format Models

IzanCatalan commented 1 year ago

Describe the issue

Hi everyone!

I would like to know if it is possible to train Neural Network models using Onnx Runtime and export them in ONNX format (only using Onnx runtime).

I only saw jupyter notebooks in some specific CNN models like Mobilenet, Resnet or Vgg in Onnx Model Zoo GitHub repo(https://github.com/onnx/models/blob/main/vision/classification/vgg/train_vgg.ipynb) and save the trained models on onnx format, which is not shown in the notebooks.

Moreover, I wonder if it is also possible to re-train some pre-trained, state-of-the-art NN ONNX format models in order to do Transfer Learning. I would like to re-train the NN models, change some of their weights and, later, see the results in terms of accuracy.

Any help would be appreciated. Thanks!

Izan.

To reproduce

You don't need to do any steps.

Urgency

No response

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.12.0

PyTorch Version

None

Execution Provider

CUDA

Execution Provider Library Version

Cuda 10.2 and 11.2

baijumeswani commented 1 year ago

@IzanCatalan The answer to your larger question is yes, one can use onnxruntime to perform training. :)

We support two forms of training:

On-Device Training: the API was primarily designed for training on edge devices. But the API can also be used on the server side if the user wants to. We have support for the cuda execution provider so it should meet your needs. The API supports easy transition to inferencing. The limitation for this API is that it is not designed for large model training, so features like distributed training are not directly available. You can find the python API details here. I am not sure, but it seems like this is what you're interested in.
Large Model Training: This is primarily for training state of the art large models using onnxruntime. ORTModule is a PyTorch frontend and therefore looks and feels exactly like any other torch.nn.Module; the only difference being that the execution of the model in ORTModule is done by onnxruntime's engine. The entry point to this form of training is a PyTorch model, and most libraries compatible with an torch.nn.Module will work with ORTModule. More details can be found here.

Hope this helps you.

IzanCatalan commented 1 year ago

@baijumeswani Thanks for your quick answer. Based on your two options, I agree with you that the option that interests me most is the first one. I have my own device with a GPU with cuda 10.2/11.2 so it is nice if the API supports it. Furthermore, I do not seek large Model Training or train a model from the beginning. I already downloaded pre-trained models like Resnet50 or VGG16 from the official ONNX Model Zoo GitHub repo. I want to use these models, modify their weights and train them with Imagenet Dataset again. According to what you describe, I guess I will be able to do it with the Python API.

Do you have any similar examples other than those shown at the API?

Thanks.

baijumeswani commented 1 year ago

Here is an example for MobileNetV2:

Please note that i am currently working on making these examples up to date with our API (they are a little outdated and may not work out of the box).

zhijxu-MS commented 1 year ago

@IzanCatalan does the shared example work for you, anything else we can help here?

IzanCatalan commented 1 year ago

Sorry @zhijxu-MS, I was busy this past month, and I couldn't take a look. I'm going to focus now on implementing what @baijumeswani suggested to me.

However, there is some trouble with the links, some of them are down, and those from GitHub repo don't exist at this moment. I could find this link inside onnx-trining-exmaples repo which may be similar: https://github.com/microsoft/onnxruntime-training-examples/tree/master/on_device_training/mobile/android/c-cpp

I am interested in "artifacts". It seems that you can load an onnx model, and with these artifacts, select the parameters of the layer which won't be modified in post-training. Is that correct?

Moreover, as my intention from the beginning was to apply some kind of transfer learning over several onnx models, my idea is first to modify some weights and save the model in a new onnx file, later I will re-train the model.

Transfer learning means changing some elements from some weights training the model again, and seeing how the weights are updated to keep the accuracy. However, the weights I modified to do this transfer learning must remain intact. I plan to do so with on-device training from ONNX API, because the modified onnx models are already pre-trained. Am I correct in assuming I could retrain the new model (with, for example, full Imagenet train dataset) with this option?

I thought that with "artifacts", with option frozen_params, I could do this. However, this only applies to the full parameter, not individual elements inside a parameter. Could I "froze" some elements inside a parameter?

I would appreciate any help from you @baijumeswani and @zhijxu-MS .

Thank you.

microsoft / onnxruntime