microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.59k stars 2.92k forks source link

[Training] Forcing Parameter's Values during On Device Training #19261

Open IzanCatalan opened 9 months ago

IzanCatalan commented 9 months ago

Describe the issue

Hi everyone, I'm trying to force some parameter values (convolution layers weighs) during the re-training process using OnDevice Training features -> onnx-runtime-training-examples Repo. I am re-training some onnx models from ONNX Model Zoo Repo.

My goal is to eventually achieve an onnx model where, after the re-training phase, some weights values are the ones I determine. All of that is done by keeping the same accuracy. I updated the values during the re-training (via OnDevice training with ORT), inserting the same values in every epoch.

To be more specific inside the training phase, after putting the model in train mode (Model.train()) and after the optimizer has performed its task (Step), one time after 100 batches, I insert the values I want in the parameter I want (always the same values and as commented before, the same procedure every epoch) accessing Checkpoint State Parameters.

After that part, I continue the standard training course by resetting the gradients (Lazy Reset).

However, after all that explanation, my question is why, despite my modification of parameters, in the end, when the onnx model is completely re-trained with the same accuracy, the weights do not present the values modified.

Is there something I'm doing wrong? Perhaps I am modifying the weights in the wrong phase?

Any help would be appreciated it .

Thanks.

imagen

To reproduce

I am running onnxruntime build from source for cuda 11.2, GCC 9.5, cmake 3.27 and python 3.8 with ubuntu 20.04.

Urgency

As soon as possible

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-training 1.17.0+cu112

PyTorch Version

None

Execution Provider

CUDA

Execution Provider Library Version

Cuda 11.2

baijumeswani commented 9 months ago

My goal is to eventually achieve an onnx model where, after the re-training phase, some weights values are the ones I determine. All of that is done by keeping the same accuracy.

Do you not want to train a part of the model parameters? In that case, you can put those parameters that you do not want to train inside frozen_params.

Could you share steps to reproduce what you're seeing? I can take a look.

IzanCatalan commented 9 months ago

Hello @baijumeswani, to answer your question, I want to train specific model parameters across multiple layers.

This includes training around 20% of the tensor weights in some layers and 5% in others. Therefore, I cannot put them into 'Frozen Params'.

That is why I modify and force values, changing them once every 100 batches per epoch using the 'Checkpoint Parameters' function. This helps to make the training adaptive to the modified values.

To clarify my objective, consider a CNN model with three convolution layers. I aim to set element x in the first layer to 25, element y in the second layer to 5, and element z in the last layer to 125.

The three elements must be updated to the values specified at the end of the training phase while maintaining the same accuracy as the original CNN model. To achieve this, I added all 3 layers to the 'requires_grad' list in the On Device training functions, enabling them to be trained and allowing other weight values to adjust to counteract the effects of the forced values (x, y, and z) on accuracy.

This explanation is my purpose, the results obtained showed me that the parameters updated to the specified values didn't keep those values at the end of the training. The question I need to know is why is that so? And if there is something I miss about how training is made or how can I achieve my objective as I described above.

baijumeswani commented 9 months ago

This explanation is my purpose, the results obtained showed me that the parameters updated to the specified values didn't keep those values at the end of the training. The question I need to know is why is that so? And if there is something I miss about how training is made or how can I achieve my objective as I described above.

From my understanding of your purpose is as follows:

Example weight tensors on each layer:
Layer 1: tensor_weight_1: Tensor[100]
Layer 2: tensor_weight_2: Tensor[50]
Layer 3: tensor_weight_3: Tensor[10]

You want to train a part of Layer 1 (let's say 20% of it). For example tensor_weight_1[:10] and tensor_weight_1[20:30] should be trainable and the rest should not.

Similarly, on Layer 2, perhaps 10% of the weight tensor is trainable. For example tensor_weight_2[:5] is trainable and the rest is not.

Please let me know if I correctly described your scenario. From your description, I also gathered that when you set a parameter value and try to retrieve it immediately without calling optimizer step on it, the parameter value remains the same as what you initially set it to.

I don't have a good answer as to why the training didn't maintain the parameter values after the training was complete. Perhaps, the model considers finetuning those elements will help improve the model? I don't have a good answer to this.

But maybe a good solution might be to set the gradients to 0 for those elements in the weight tensor that you want to keep fixed/frozen before calling the optimizer step. We don't have a way of doing that right now in ONNX Runtime. If you like, you could pick up this work for ONNX Runtime. I can help you with the development.

IzanCatalan commented 9 months ago

@baijumeswani Yes, your understanding is correct. It occurs to me a doubt that perhaps it is related. When I modify the parameters, I do so by using Checkpoint State Parameters. As I am working with cuda, the training is done in the GPU, and I set the Module of the model in the following way:

    model = orttraining.Module(
        "docker/training_artifactsFULL/training_model.onnx",
        checkpoint_state,
        "docker/training_artifactsFULL/eval_model2.onnx",
        "cuda"
    )

Therefore, both the data and optimizer.step() are processed on the GPU. Could this fact affect the updated parameters? I reasoned that if I modify the parameters inside the CheckpointState function on the CPU, I am unsure if the parameters will be updated on the GPU.

I am unsure about the inner workings of the ONNX Runtime and whether these parameters are sent to the GPU for copying. If the parameters are only updated in the CPU while the actual copy is in the GPU, this could explain why the final value differs from the modified one while in the training, when accessing them, the value is ok.

Send me what you think of this.

baijumeswani commented 9 months ago

Therefore, both the data and optimizer.step() are processed on the GPU. Could this fact affect the updated parameters? I reasoned that if I modify the parameters inside the CheckpointState function on the CPU, I am unsure if the parameters will be updated on the GPU.

I am unsure about the inner workings of the ONNX Runtime and whether these parameters are sent to the GPU for copying. If the parameters are only updated in the CPU while the actual copy is in the GPU, this could explain why the final value differs from the modified one while in the training, when accessing them, the value is ok.

This is not the case. When you call the getter on the parameter, the underlying C++ code will copy the data from the device where the parameter resides (in your case from cuda) to the cpu and return the parameter as a numpy array in Python.

Similarly, when you call the setter, the underlying C++ code will copy the given raw data in the numpy array from cpu to the device where the parameter resides (in your case cuda).

You can find the code here (for the getter) and here (for the setter).

IzanCatalan commented 9 months ago

@baijumeswani Yes, it is correct. I have performed a debug training execution printing a tensor value since it was modified in the first batch with the exact value along every batch until it is modified again in batch 100 and the next epoch. I have concluded several things which there may be interesting, and I would like to discuss them with you:

1) When a value is assigned next in the following batch, it is partially modified, i.e., if I assign to a value the number 25 immediately after the assignation, the value is kept, but in the next batch, the value has already been modified by thousandths (24.999327). The training does this minor update, and each batch happens until in batch 100, I modify the value again (24.996214 for 25 again). This is the expected behaviour of the training, and as you said previously unless the gradients are set to 0 for those weights, this will continue to happen. The only doubt I have still is if my practice of modifying values for every 100 batches is a good approximation because, as I checked, the values modified don't change by an extensive range, only by decimals. Perhaps modifying them once at the first epoch, in the middle epoch and at the end would show the same expected behaviour.

2) Related to point 1. In this case, I detected that the training and, more precisely, the weight updating is not done ONLY with exponential positive numbers. For example, If I modify a value for 2.58e+8 (the exponent needs to be +4 or higher for this to happen), the successive batches, the exact value is not updated. Not even a single decimal. I wonder if this behaviour is expected or intentionally programmed in ORT. I put here the log of the first 20 batches and the modification of 5 values:


----------------- batch 0 modification ---------------------------------------------
update value  1-> Before:  -0.037884194 | After: 258000000.0
update value  2-> Before: -0.03131715 | After:258000000.0
update value  3-> Before: -0.021781376 | After: 258000000.0
update value  4-> Before: 0.0057122777 | After: 258000000.0
update value  5-> Before: -0.010252925 | After: 258000000.0
-----------------end batch 0 modification ---------------------------------------------
Batch   1 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   2 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   3 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   4 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   5 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   6 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   7 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   8 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   9 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   10 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   11 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   12 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   13 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   14 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   15 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   16 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   17 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   18 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   19 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0
Batch   20 258000000.0 258000000.0 258000000.0 258000000.0 258000000.0

3) There is another behaviour that I observed, and I believe is some internal procedure of ORT. The point is that when I modify a tensor value for a number with a negative exponent higher than e-3 (2.22e-4 or -2.67e-8), the values are modified, but automatically, in the next batch (once a "lazy_reset_grad" function is done and later, an optimizer step) the value printed has a different value, more like a clipped value or similar. For instance, if I perform the same example as in point 2), but change the modification of 5 values instead of 2.58e+8 for 2.58e-8, I get a completely different result. Here, the numbers are suddenly modified automatically in the next optimizer step by 0.00093. The same happens if the exponent or the number changes. For instance, if I put 3.54e-21, I get the same outputs:

----------------- batch 0 modification ---------------------------------------------
update value  1-> Before: -0.037884485 | After: 2.58e-08
update value  2-> Before: -0.031317357 | After: 2.58e-08
update value  3-> Before: -0.02178156 | After: 2.58e-08
update value  4-> Before: 0.005712017 | After: 2.58e-08
update value  5-> Before: -0.010253467 | After: 2.58e-08
-----------------end batch 0 modification ---------------------------------------------
Batch 1 0.0009361423 0.000996921 0.000982629 0.0009952293 0.0008674057
Batch 2 0.0015061507 0.0015384969 0.0014174902 0.0014691905 0.0013345057
Batch 3 0.0016142927 0.0015493307 0.0011915138 0.0012106828 0.0013763082
Batch 4 0.0020594415 0.002003469 0.0015627207 0.0015972403 0.0018372384
Batch 5 0.0021611708 0.0021422761 0.0016400404 0.0017286006 0.0020645838
Batch 6 0.00250552 0.0025339758 0.0019215712 0.0020034679 0.0023976522
Batch 7 0.0028143644 0.0028893726 0.0022213915 0.0022978196 0.0027447566
Batch 8 0.0030807946 0.0031971931 0.0024978882 0.0025674033 0.0030366327
Batch 9 0.0033630736 0.0035220943 0.002794941 0.0028516322 0.003338706
Batch 10 0.0036043492 0.0037909804 0.0030318222 0.003069865 0.0035745974
Batch 11 0.0038263227 0.0040414403 0.0032599159 0.0032845358 0.0037975768
Batch 12 0.004024314 0.004267491 0.0034648122 0.0034766842 0.003997858
Batch 13 0.0043891985 0.004659093 0.0038425846 0.0038288196 0.0043487013
Batch 14 0.0047034775 0.0049992045 0.004170831 0.0041346042 0.0046551344
Batch 15 0.004988883 0.0053084446 0.0044691465 0.004412239 0.004933438
Batch 16 0.0052593662 0.0056017535 0.0047517535 0.00467334 0.0051930416
Batch 17 0.00550428 0.0058670994 0.005007386 0.0049094143 0.005427841
Batch 18 0.0057322085 0.0061139045 0.005245985 0.0051298304 0.0056467797
Batch 19 0.0059441444 0.0063437056 0.00546871 0.005334968 0.0058491537
Batch 20 0.006133957 0.00655045 0.0056693996 0.0055196458 0.006031317

I would like to understand why these 3 points happen or if there is normal behaviour for training and I am missing something.

IzanCatalan commented 9 months ago

@baijumeswani, any updates on my last post? Were you able to check it?

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.