[Training] Define a custom training with some ONNX models

IzanCatalan commented 1 year ago

Describe the issue

Hi everyone,

I would like to know if it is possible to define a custom loss function together with custom forward and backward propagation methods to modify the weights to force some of them to be determinate values.

Thanks.

To reproduce

-

Urgency

No response

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.12.0

PyTorch Version

None

Execution Provider

CUDA

Execution Provider Library Version

Cuda 10.2 and 11.2

thiagocrepaldi commented 1 year ago

Not sure if I followed, but it should be possible to have a custom loss function within the forward method. The backward propagation is automatically built from the forward

baijumeswani commented 1 year ago

IzanCatalan could you please elaborate on what you mean by

custom forward and backward propagation methods to modify the weights to force some of them to be determinate values?

A couple of questions for you:

Are you trying to train using ONNX Runtime?
Do you already have an ONNX model (forward only or forward + loss + backward)?

IzanCatalan commented 1 year ago

Sorry for my poor description @baijumeswani @thiagocrepaldi , I will try to explain myself.

Currently, I have some pre-trained onnx models, some downloaded from OnnxModelZoo GitHub Repo (Resnet50,Vgg16...). These models are trained, and I can obtain the same accuracy specified on the git Repo (around 71% Top1 accuracy). I am also using OnnxRuntime to run these models and get de inference accuracy with ImageNet Dataset.

My question now is if it could be possible to re-train these models (not from the beginning) with some custom backward propagation and custom loss function. I want to do so to promote some values inside some weights during the training instead of normal training.

I was searching, and I could find "On device training" option in OnnxRuntime Python API. Inside "Advanced Usage" there are also forward and training blocks. However, I am wondering if those functions are what I want. I need help to understand them.

Any help or code example in this matter would be very helpful.

Thanks for your replies @thiagocrepaldi @baijumeswani , I hope with this explanation, I can clarify your doubts.

baijumeswani commented 1 year ago

Here is an example that uses the MobileNetV2 model for training. It uses the default SoftmaxCrossEntropyLoss function and only trains the last classifier layer in the model:

https://github.com/microsoft/onnxruntime-training-examples/tree/master/on_device_training/mobile/android/c-cpp

You can refer to the files for more details.

If your use case is more advanced than this, would you please describe what kind of loss function you want to use and what you mean by custom backward propagation?

IzanCatalan commented 1 year ago

Hi @baijumeswani, I have been testing and looking at the examples of your links. I have some doubts I would like to ask you:

1) Is there any way to run the training on GPU? I tried to add "gpu" when creating the module, but it reports an error about an unsupported device gpu. Do I need to install some tools or import something? Because when building Onnxruntime everything was okay, without cuda or GPU errors:

model = orttraining.Module(
    "training_artifacts/training_model.onnx",
    checkpoint_state,
    "training_artifacts/eval_model.onnx",
    "gpu",
)

with error:

  warnings.warn(f'Both `mxnet=={mx.__version__}` and `torch=={torch.__version__}` are installed. '
Traceback (most recent call last):
  File "train.py", line 47, in <module>
    model = orttraining.Module(
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/api/module.py", line 53, in __init__
    get_ort_device_type(self._device_type, device_id),
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 30, in get_ort_device_type
    raise Exception("Unsupported device type: " + device_type)
Exception: Unsupported device type: gpu

2) Related to how to train the new model. Inside your training loop, there are three steps: loss calculation, optimizer step, and reset gradients. Inside this loop, could I see or access the weights that are being modified?

I try to explain myself. After every phase of updating gradients, losses, and optimizer (the main train loop of train.py file that you linked), I would like to access or modify the weights in order to see if their value is enough according to a target value calculated by me. In that case, the training would finish.

Thanks!

baijumeswani commented 1 year ago

Use "cuda" instead of "gpu".
Currently, we don't have a good mechanism to query the weights. The weights are a part of the CheckpointState and the plan is to add utility methods to access the weights and their gradient given an instance of the CheckpointState. Behind the scenes (in ort C++), the checkpoint state contains an unordered map of parameter name to the parameter value (which also contains the gradient information). However, as of now, this information is not exposed through python (or any other language binding for that matter). But we intend to add this functionality in the near future. I can try to see if I can get to it later this month/early next month.

IzanCatalan commented 1 year ago

Thanks for your help, @baijumeswani . Please let me know when you have any updates about this. Referencing your previous answer, at question 1, I have changed "gpu" to "cuda", and now reports me a warning:

2023-07-19 11:40:43.079052029 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-07-19 11:40:43.079101490 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.

This is a standard warning with no performance issues, or must I add a flag or something else when creating training artifacts to prevent this warning?

About question 2, I have been searching inside Python Api, and I wonder several things:

1) Could I archive accessing the weight while training with Large Model Training? Perhaps this kind of training performs functions to access them like ORTModule.parameters()

2) In On Device Training API, there are several functions for which I'm not sure of their purpose, and perhaps you could help me to understand them:

Any of those functions could help to achieve my purpose of accessing weights?

baijumeswani commented 1 year ago

2023-07-19 11:40:43.079052029 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-07-19 11:40:43.079101490 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.

This is a standard warning with no performance issues, or must I add a flag or something else when creating training artifacts to prevent this warning?

The ONNX model has several nodes. Each node (or a collection of nodes in some cases) maps to a kernel implementation. The kernel implementation can be written for cpu, gpu or both. When the kernel implementation is not available for a gpu, there is an implicit understanding that that op must be executed in cpu. Note that there could be other reasons that the execution is done on cpu as opposed to gpu as well. This is what the warning message is saying. I think in most cases, this warning message can be ignored.

Could I archive accessing the weight while training with Large Model Training? Perhaps this kind of training performs functions to access them like ORTModule.parameters()

The entry point to ORTModule and On-Device Training API is different, and they cannot be used together. As of now, there is no way to access the model parameters/gradients with On-Device Training APIs. I will add this functionality soon.

In On Device Training API, there are several functions for which I'm not sure of their purpose, and perhaps you could help me to understand them:

https://onnxruntime.ai/docs/api/python/on_device_training/training_api.html#onnxruntime.training.api.CheckpointState.__setitem__

This is used for adding custom properties. For example, if you want to add the epoch number to the checkpoint, or the best loss or some information that the user may want to save to the checkpoint for later retrieval.

This is also probably how I would envision being able to access the model parameters in the future. But as of now, the model parameters are not accessible through this.

https://onnxruntime.ai/docs/api/python/on_device_training/training_api.html#onnxruntime.training.api.Module.copy_buffer_to_parameters

This function was designed for the purpose of federated learning where the user might want to access all the model parameters (as a single buffer) to serialize and send to and from a central server.

You could use this function to access the model parameters. However, there is no demarcation information available in them. All the model parameters are returned as a single contiguous array and you want be able to tell which element belongs to which parameter. Having said that, if you know the size of each model parameter, you could decipher the demarcations on your end. Typically the order of the parameters is the same as the order of parameters in which the checkpoint was saved.

IzanCatalan commented 1 year ago

Hello again @baijumeswani , I will look into everything you told me. In the meantime, I'm focused on training models following the python scripts in onnx-trainning-examples repo. Now, I am getting the following error when I try to retrain a resnet50 onnx model:

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Invalid rank for input: target Got: 1 Expected: 2 Please fix either the inputs or the model.

This error appears when running the code "loss += model(batch, labels)" according to the example in Training phase

I executed the jupyter notebooks, both Offline tooling and Training phase. When running this example using Mobilenet pre-trained model from torchvision works fine, but when I use my resnet50 model fails, and I figured out the reason why.

When I open in Netron the eval_model.onnx and the training_model.onnx created by the jupyter notebook, both have the next inputs:

name: input type: float32[batch,3,224,224] name: labels type: int64[batch]

With this format, when calling to "loss += model(batch, labels)", a label shape of 1 dimension is expected, so everything is okay.

However, in my case, with two different resnet50 models from ONNX Model Zoo Repo (https://github.com/onnx/models/blob/main/vision/classification/resnet/model/resnet50-v1-12.onnx) and torchvision (https://pytorch.org/vision/0.14/models/generated/torchvision.models.resnet50.html#torchvision.models.resnet50), the inputs I obtained when creating the eval_model.onnx and the training_model.onnx (with the same code used in Offline tooling) have the following shapes:

name: data type: float32[N,3,224,224] name: target type: float32[N,1000]

The name 'data' is the name of the input in the resnet50 models, and the name 'input' is the name of the input in the mobilenetv2 model. However, I don't quite understand how the name for the labels changes from "labels" to "target" and what is more important, the change of shapes from [batch] to [N,1000].

'N' and 'batch' are different names for the batch size of the model. However, the number of dimensions is essential. It seems that when executing "loss += model(batch, labels)", it is expected the shape[N,1000] however, the labels don't have that shape because they only have the correct class prediction. The number 1000 is the total number of classes of Imagenet. I show you some pictures below.

I would like to know how I can change the configuration of the training artifacts just to get in the eval_model.onnx and training model.onnx an expected shape of the labels that match de batch size, just like it is done with the jupiter notebook and with mobilentv2 model in Offline tooling.

Thanks for your help! Mobilenet Model: imagen Mobilenet Training Model: imagen

Resnet50 Model: imagen Resnet50 Training Model: imagen

baijumeswani commented 1 year ago

Did you by any chance change the loss type? If you selected the loss type as CrossEntropyLoss, you should see the labels as input and not target. Perhaps you're using MSELoss or BCEWithLogitsLoss in your script?

IzanCatalan commented 1 year ago

Yes, @baijumeswani I was usign MSELoss, now I changed to CrossEntropyLoss and the shapes are okey, How is that possible just by changing the loss type? Does it mean that I cannot use MSELoss ?

baijumeswani commented 1 year ago

the definition of MSELoss is

reduce((target - y)^2).

So, target must be the same rank as y (the output of the model). MSELoss can typically be used when dealing with a regression task, while cross entropy loss is extensively used for the classification task.

IzanCatalan commented 1 year ago

Ah, okay, @baijumeswani, I got it. Thanks again. However, I still need help to make the code work. I'm having trouble with an internal function of onnxruntime. When running the training, it reports a fail regarding a 0-dimension array. However, I use a code similar to Mnist training. I will show you some pictures of the code below.

The error happens when running the line "trainloss, = model(*forward_inputs)", having inside forward inputs the batch images and the label targets inside a list. I used exactly the same code. However, my dataset is loaded with a torch-vision imagenet loader module. I debug the code with the Python version of GDB called PDB, and then going step by step, I checked that the last line of the code before the failure is the one in /usr/local/lib/python3.8/dist-packages/onnxruntime/training/api/module.py:93 -> return fetches[0].numpy().

This is the trace of the error:

Traceback (most recent call last): File "train.py", line 110, in train(epoch) File "train.py", line 39, in train trainloss, = model(*forward_inputs) TypeError: iteration over a 0-d array

/usr/local/lib/python3.8/dist-packages/onnxruntime/training/api/module.py:93

def __call__(self, *user_inputs) -> tuple[np.ndarray] | np.ndarray:
    """Invokes either the training or the evaluation step of the model.

    Args:
        *user_inputs: The inputs to the model.
    Returns:
        The outputs of the model.
    """
    is_np_input = False
    forward_inputs = OrtValueVector()
    forward_inputs.reserve(len(user_inputs))
    for tensor in user_inputs:
        if isinstance(tensor, np.ndarray):
            is_np_input = True
            forward_inputs.push_back(OrtValue.ortvalue_from_numpy(tensor)._ortvalue)
        elif isinstance(tensor, OrtValue):
            forward_inputs.push_back(tensor._ortvalue)
        else:
            raise ValueError(f"Expected input of type: numpy array or OrtValue, actual: {type(tensor)}")
    fetches = OrtValueVector()

    if self.training:
        self._model.train_step(forward_inputs, fetches)
    else:
        self._model.eval_step(forward_inputs, fetches)

    if len(fetches) == 1:
        if is_np_input:
            return fetches[0].numpy() --------->FAILS!!!

        return fetches[0]

    return tuple(val.numpy() for val in fetches) if is_np_input else tuple(fetches)

I also have debugged what is inside fetches[0], and I compare the output with Training phase. Training notebook always works fine and when debugging throws the next output: imagen I printed the shapes of the batch images and the labels for one epoch (it prints 20 times in total because 20 is the number of images per class with a batch size of 4 images (one for each class and 20 batches in total), and for each batch, a training step is done with "loss+=model(data,label)"). Below them is the length of 'fetches' and 'fetches[0]'.

I don't know what it is, but it seems never the same number. However, doing the same with my code only printed one number and failed when running return fetches[0].numpy: imagen Seems a little strange because it prints the number inside fetches[0] but later fails in the return. I don't know how to proceed because I believe I do training the correct way, with the proper shape in batch images and labels: (4, 3, 224, 224) and (4,), precisely the same as the Training phase and Mnist training notebook.

I need some help again to find the source of the problem. I let you with some pictures of my code. Thanks again! imagen imagen

IzanCatalan commented 1 year ago

Hi again @baijumeswani , Is there any update about your plan of adding utility methods to access the weights and their gradient given an instance of the CheckpointState? In addition, the previous error I commented on July 27, do you have any thoughts about how to solve it? Thank you.

baijumeswani commented 1 year ago

Hi again @baijumeswani , Is there any update about your plan of adding utility methods to access the weights and their gradient given an instance of the CheckpointState?

Maybe coincidental :). Please see https://github.com/microsoft/onnxruntime/pull/17364. You should be able to access the weights and their gradients after this pull request.

In addition, the previous error I commented on July 27, do you have any thoughts about how to solve it? Thank you.

Will take a deeper look at it this week. Sorry for the delay.

baijumeswani commented 1 year ago

IzanCatalan are you able to share your training model, checkpoint file? Or show me how to reproduce your error?

IzanCatalan commented 1 year ago

Hi @baijumeswani , sorry for the delay. Yes, I can share everything you need to check the error. I leave you here a link to a git Repo where you can find the following files: Link: https://github.com/IzanCatalan/docker/tree/master

prepare_for_training.py -> set the artifacts that will be later used for training. To reproduce: python3.8 prepare_for_training.py resnet50-v1-12.onnx /directoryToTrain
train.py -> re-train the model using the artifacts and the already generated eval and training models together with checkpoint file (it mast be in the same directory that they are). To reproduce: python3.8 train.py
resnet50-v1-12.onnx -> original onnx model.
training_model, eval_model and checkpoint are the artifacts created by prepare_for_training.py.

In addition, you must have downloaded Imagenet dataset 2012 to load all the training dataset used in train,py.

IzanCatalan commented 1 year ago

And another doubt @baijumeswani , regarding the https://github.com/microsoft/onnxruntime/pull/17364 push. Apparently, it is still under review. You can specify which functions I can use to access de weight? Are those functions already in the Python API? I could find this: https://onnxruntime.ai/docs/api/python/on_device_training/training_api.html#onnxruntime.training.api.Module.get_contiguous_parameters

Do you have any examples of using these new functions?

Thanks for all the help!

IzanCatalan commented 1 year ago

Hi @baijumeswani , have you any update about my doubts??

baijumeswani commented 1 year ago

@IzanCatalan, I think I know the reason for your error:

train_loss, _ = model(*forward_inputs)

You're asking the output of the training onnx model be returned as a tuple. In reality, the training_model.onnx only has ` registered user output which is the loss. So, you must change this call to:

train_loss = model(*forward_inputs)

You must do the same thing for the test function as well. Change:

test_loss, logits = model(*forward_inputs)

to

test_loss = model(*forward_inputs)

Now, if you do want the logits output in the eval model, you must register those outputs in the onnx model before loading them to onnxruntime:

# Code from prepare_for_training.py
artifacts.generate_artifacts(
   onnx_model,
   requires_grad=requires_grad,
   frozen_params=frozen_params,
   loss=artifacts.LossType.CrossEntropyLoss,
   optimizer=artifacts.OptimType.AdamW,
   artifact_directory=sys.argv[2]
)

eval_model = onnx.load(f"{sys.argv[2]}/eval_model.onnx")
eval_model.graph.output.append(onnx_model.graph.output[0])

Note that the training model still only has 1 user output registered. So, you must query it using training_loss = model(...). The eval model has two user outputs registered, so you can query them using eval_loss, logits = model(...)

Give this a try. Hopefully this will resolve your problem.

IzanCatalan commented 1 year ago

Hi @baijumeswani, I have corrected the errors in the train and test functions. You were right, now it works, and no error happens. However, I have an error when I export the model after the test function:

# ort training api - export the model for so that it can be used for inferencing
model.export_model_for_inferencing("inference.onnx", ["output"])

This is the output I get:

Traceback (most recent call last):
  File "train.py", line 134, in <module>
    model.export_model_for_inferencing("inference.onnx", ["output"])
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/training/api/module.py", line 184, in export_model_for_inferencing
    self._model.export_model_for_inferencing(os.fspath(inference_model_uri), graph_output_names)
RuntimeError: /home/onnxruntime/orttraining/orttraining/training_api/module.cc:492 onnxruntime::common::Status 
onnxruntime::training::api::Module::ExportModelForInferencing(const string&, gsl::span<const std::__cxx11::basic_string<char> >) const [ONNXRuntimeError] : 1 : FAIL : module.cc:79 
TransformModelOutputsForInference Expected graph output for inference graph output could not be found. Please regenerate the eval graph.

Apparently, there is a problem with the eval graph, so I think something needs to be fixed with how I append the outputs to the eval model. According to what you recommended, I modified how I generate the artifacts to:

# Generate the training artifacts.
artifacts.generate_artifacts(
   onnx_model,
   requires_grad=requires_grad,
   frozen_params=frozen_params,
   loss=artifacts.LossType.CrossEntropyLoss,
   optimizer=artifacts.OptimType.AdamW,
   artifact_directory=sys.argv[2]
)

eval_model = onnx.load(f"{sys.argv[2]}/eval_model.onnx")
eval_model.graph.output.append(onnx_model.graph.output[0])
onnx.save(eval_model, "eval_model2.onnx")

Is there something wrong with it?

baijumeswani commented 1 year ago

Hi IzanCatalan. From looking at your eval model, the output that you should be requesting is "resnetv17_dense0_fwd".

That is the name of output of the original inference model (which is also fed into the loss in the training and eval models).

model.export_model_for_inferencing("inference.onnx", ["resnetv17_dense0_fwd"])

should work.

IzanCatalan commented 1 year ago

Thanks, @baijumeswani. It works fine now. However, I have a problem with the function export_model_for_inferencing. The output model has opset number 19. Currently, I have installed onnx 1.14 and onnxruntime build from source 1.17+cu112. With this configuration, the new inference model works.

However, when I try to run inference in other machines where I have installed onnxruntime-gpu from pre-build packages (by doing pip install ..), it fails and reports me the following error:


  File "insertValidate.py", line 172, in evaluate
    ort_session_cpu = onnxruntime.InferenceSession(model_path,providers=['CUDAExecutionProvider'])
  File "/usr/local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/usr/local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from inference.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model_load_utils.h:47 void onnxruntime::model_load_utils::ValidateOpsetForDomain(const std::unordered_map<std::basic_string<char>, int>&, const onnxruntime::logging::Logger&, bool, const string&, int) ONNX Runtime only *guarantees* support for models stamped with official released onnx opset versions. Opset 19 is under development, and support for this is limited. The operator schemas and or other functionality may change before next ONNX release and in this case, ONNX Runtime will not guarantee backward compatibility. Current official support for domain com.ms.internal.nhwc is till opset 17.

To be more accurate, the new inference onnx model fails when tryining to run inference with onnxruntime-gpu 1.12 and 1.4. I have these versions because of the compatibility with Cuda (11.2 and 10.2, respectively -> Compatibility ).

What do I need to do to transform the onnx model from opset 19 to a minor number? Currently, I run inference with opset 12 models, so it would be perfect to downgrade opset to 12. I tried to find a way, but export_model_for_inferencing has no parameter to configure the opset. I tried to convert the model by using onnx version converter, but it does not work (the same error happens):

converted_model = version_converter.convert_version(model, 12)
onnx.save(converted_model, new_model)

IzanCatalan commented 1 year ago

@baijumeswani Hello again; regarding my last post, I tried to update my onnx package version (only in one machine with prebuild onnxruntime package) from 1.12 to 1.14.1 without updating onnxruntime-gpu (for cuda compatibility, I cannot do it). I believed that since onnx version 1.14 is already compatible with opset up to 19, perhaps updating only onnx package would have solved the problem. But I was wrong. It doesn't work.

I also have checked the opset version of the new inference.onnx model created with:

opset_version = onnx_model.opset_import[0].version if len(onnx_model.opset_import) > 0 else None

And the output is opset 14. So I don't know why the error still happens and it tells the same:

ONNX Runtime only *guarantees* support for models stamped with official released onnx opset versions. Opset 19 is under development and support for this is limited. The operator schemas and or other functionality may change before next ONNX release and in this case ONNX Runtime will not guarantee backward compatibility. Current official support for domain com.ms.internal.nhwc is till opset 17.

How can I convert the model, or at least create it with a lower opset?

baijumeswani commented 1 year ago

Would you please share the final inference model that you're trying to load into the InferenceSession?

IzanCatalan commented 1 year ago

@baijumeswani Yes, you can find two models in inferenceFolder:

newInferenceMobilenet.onnx is an onnx model created from the train notebook Training phase .
newInferenceResnet.onnx is an onnx model creted by myself following your comments from the previous posts.

Both of them fail when loading them into the Inference Session with prebuild onnxruntime-gpu 1.12 and 1.4 with onnx 1.12/1.14.

IzanCatalan commented 1 year ago

Hi again @baijumeswani , is there any update about my last post?

Sorry, I have another question about generating artifacts for an ONNX model. I am creating artifacts from a ResNet50 model (which can be found in my Git Repo).

However, I am creating multiple artifacts with different configurations of required and frozen parameters. Specifically, I am facing an issue with this ResNet50 model when generating artifacts by selecting all parameters from the model to be required:

frozen_params = []
requires_grad = [
   param.name
   for param in onnx_model.graph.initializer
   if param.name not in frozen_params

Then, I get the following error:

RuntimeError: /home/onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc:897 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::python::PyGradientGraphBuilderContext*)> [ONNXRuntimeError] : 1 : FAIL : graph_augmenter.cc:23 AddToExistingNodeArgs add graph outputs - failed to find NodeArg by name: resnetv17_stage3_batchnorm1_running_var_grad

The parameter "resnetv17_stage3_batchnorm1_running_var_grad" cannot be found in the initialisers, nodes or inputs of the ResNet50 model. It is possible that a parameter from the optimizer-training artifact model will be generated. As a solution, I also attempt using a parameter in the frozen parameters.

frozen_params = ["resnetv17_conv0_weight"]
requires_grad = [
   param.name
   for param in onnx_model.graph.initializer
   if param.name not in frozen_params
]

However, the same error arises, but this time with the parameter "resnetv17_stage2_batchnorm4_running_var_grad". It seems that there may be an issue with the batch normalization layers, as selecting only the required convolution layers as parameters with no frozen parameters works correctly.

Is this a bug or it is a problem of my required/frozen params configuration?

Let me know if you find out something about it or about my last post.

Thanks.

baijumeswani commented 1 year ago

@baijumeswani Yes, you can find two models in [inferenceFolder]> (https://github.com/IzanCatalan/docker/tree/master/inferenceFolder):

newInferenceMobilenet.onnx is an onnx model created from the train notebook Training phase . newInferenceResnet.onnx is an onnx model creted by myself following your comments from the previous posts. Both of them fail when loading them into the Inference Session with prebuild onnxruntime-gpu 1.12 and 1.4 with onnx 1.12/1.14.

Would it be feasible to perform inference with the same package that performs training? The onnxruntime-training package is a superset (for all practical purposes) of the onnxruntime-gpu package. So, you should just be able to depend on the training package (even for inferencing).

If for some reason, it is not possible to use the training package for inferencing, you could consider using the nightly version of the onnxruntime-gpu package. This is needed because (as far as I understand), you're using the nightly package for onnxruntime-training. And that adds an opset that is not a part of the released onnxruntime-gpu

baijumeswani commented 1 year ago

Then, I get the following error:

RuntimeError: /home/onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc:897 onnxruntime::python::addObjectMethodsForTraining(pybind11::module&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::python::PyGradientGraphBuilderContext*)> [ONNXRuntimeError] : 1 : FAIL : graph_augmenter.cc:23 AddToExistingNodeArgs add graph outputs - failed to find NodeArg by name: resnetv17_stage3_batchnorm1_running_var_grad

Not all the model initializers in the model can be trained. For example, the batchnorm layer (in training) has some initializers that cannot have their gradient computed. These initializers are the mean and variance inputs of the batch norm layer.

If you intend to train the batchnorm layer, you will probably need to put the running mean and the running variance in the frozen parameters:

frozen_params = []
requires_grad = []
for init in model.graph.initializer:
    if init.name.endswith("running_mean") or init.name.endswith("running_var"):
        frozen_params.append(init.name)
    else:
        requires_grad.append(init.name)

Please note that there could be other such initializers that are not trainable.

IzanCatalan commented 1 year ago

@baijumeswani Yes, you can find two models in [inferenceFolder]> (https://github.com/IzanCatalan/docker/tree/master/inferenceFolder): newInferenceMobilenet.onnx is an onnx model created from the train notebook Training phase . newInferenceResnet.onnx is an onnx model creted by myself following your comments from the previous posts. Both of them fail when loading them into the Inference Session with prebuild onnxruntime-gpu 1.12 and 1.4 with onnx 1.12/1.14.

Would it be feasible to perform inference with the same package that performs training? The onnxruntime-training package is a superset (for all practical purposes) of the onnxruntime-gpu package. So, you should just be able to depend on the training package (even for inferencing).

If for some reason, it is not possible to use the training package for inferencing, you could consider using the nightly version of the onnxruntime-gpu package. This is needed because (as far as I understand), you're using the nightly package for onnxruntime-training. And that adds an opset that is not a part of the released onnxruntime-gpu

@baijumeswani When you say nightly version, it means using onnxruntime from source? I don't understand it completely.

I tried to compile onnxruntime from source on my other machine. In this case, the cuda version is 10.2 and the gcc/g++ version is 7.5. In this machine, I cannot change that configuration. When building, I got an error saying the GCC version should be 8 or higher. To do so, I used a conda environment to install gcc/g++ 8.5 and to build I used this command inside this environment:

./build.sh --config=RelWithDebInfo --enable_training --build_wheel --use_cuda --skip_test --cuda_home /usr/local/cuda-10.2/ --cudnn_home /usr/local/cuda-10.2/ --cuda_version=10.2

However, I got another error:

In file included from /mnt/beegfs/users/izan/onnxruntime/onnxruntime/core/common/threadpool.cc:22:
/mnt/beegfs/users/izan/onnxruntime/include/onnxruntime/core/common/eigen_common_wrapper.h:19:32: error: unknown option after '#pragma GCC diagnostic' kind [-Werror=pragmas]
 #pragma GCC diagnostic ignored "-Wdeprecated-copy"
                                ^~~~~~~~~~~~~~~~~~~
cc1plus: error: unrecognized command line option '-Wno-deprecated-copy' [-Werror]
gmake[2]: *** [CMakeFiles/onnxruntime_common.dir/build.make:230: CMakeFiles/onnxruntime_common.dir/mnt/beegfs/users/izan/onnxruntime/onnxruntime/core/common/threadpool.cc.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1891: CMakeFiles/onnxruntime_common.dir/all] Error 2
gmake: *** [Makefile:166: all] Error 2

I guess this error happens because the compiler does not recognize a pragma. Does it mean that I have to use another gcc version? I tried GCC 10.5, and it was incompatible. Apparently, gcc 10 or higher is not an option, so that I will try version 9.

Perhaps it is because cuda 10.2 is too old and is has not compatibility with onnxruntime 1.17? Is this what you meant when saying that I should install the nightly version of onnxruntime?

baijumeswani commented 1 year ago

Sorry, I missed your comment. By nightly package, I mean the onnxruntime-gpu nightly python package that you can get by doing:

pip install coloredlogs flatbuffers numpy packaging protobuf sympy
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ ort-nightly-gpu

baijumeswani commented 1 year ago

Another thing you can do is remove the internal opset version com.ms.internal.nhwc from the model by doing:

import onnx

model = onnx.load("inference_model")
opsets = [opset for opset in model.opset_import if opset.domain != "com.ms.internal.nhwc"]
del model.opset_import[:]
model.opset_import.extend(opsets)

# ...
# Perform inference using the InferenceSession

IzanCatalan commented 1 year ago

Hi @baijumeswani , I tried the last one, and it worked! Thank you. However, I have another issue. I wonder if you could help me. I am trying to install the last version of onnxruntime from source in a conda environment to use it in other machines and to avoid eliminating opsets. I successfully installed the python wheel created in my docker container into my conda environment due to the version of cuda and cudnn is the same. However, I get this error:

Traceback (most recent call last):
  File "docker/train.py", line 2, in <module>
    from onnxruntime.training import artifacts
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/__init__.py", line 56, in <module>
    raise import_capi_exception
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import ExecutionMode  # noqa: F401
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/capi/_pybind_state.py", line 32, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
ImportError: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so)

I have glibc 2.27, but I wonder if this error is directly because my version is old or it has a relationship with other dependencies because when I used old onnxruntime versions like 1.4 or 1.12, this error didn't appear.

florm99 commented 7 months ago

I had the very same issue as @IzanCatalan mentioned:

Ah, okay, @baijumeswani, I got it. Thanks again. However, I still need help to make the code work. I'm having trouble with an internal function of onnxruntime. When running the training, it reports a fail regarding a 0-dimension array. However, I use a code similar to Mnist training. I will show you some pictures of the code below.

The error happens when running the line "trainloss, = model(*forward_inputs)", having inside forward inputs the batch images and the label targets inside a list. I used exactly the same code. However, my dataset is loaded with a torch-vision imagenet loader module. I debug the code with the Python version of GDB called PDB, and then going step by step, I checked that the last line of the code before the failure is the one in /usr/local/lib/python3.8/dist-packages/onnxruntime/training/api/module.py:93 -> return fetches[0].numpy().

This solution by @baijumeswani helped me too:

@IzanCatalan, I think I know the reason for your error:
train_loss, _ = model(*forward_inputs)
You're asking the output of the training onnx model be returned as a tuple. In reality, the training_model.onnx only has ` registered user output which is the loss. So, you must change this call to:
train_loss = model(*forward_inputs)

The reason I ran into this issue is that I copied the example for desktop on-device training. I suggest to fix this bug in the MNIST example.

microsoft / onnxruntime