lu-group / sbinn

SBINN: Systems-biology informed neural network
Apache License 2.0
25 stars 16 forks source link

Result doesn't seem right. #4

Closed RubADuckDuck closed 2 years ago

RubADuckDuck commented 2 years ago

I have trained the model on pytorch backend for 1,000,000 epochs, and ploted 'glucose.dat' and 'test.dat' together. Other then the observed 'G', results of the others doesn't look right.

Have I made a mistake on scaling the values or is it a normal result? loss

compare

mitchelldaneker commented 2 years ago

We only have 1 observable, G. This means the network will only be looking at G when it is training and estimating parameters. Since the network cannot observe the other 5 state variables, its prediction of those state variables will be very poor as seen in your figure. That is why we have the ODE model. You will get much better information on the other state variables when you solve the ODE model with the inferred parameters.

Note that in the practical identifiability analysis section of the paper, you will find that one of the parameters is unable to be inferred. This is due to that parameter having no effect on G - thus making it unidentifiable when you only have G to estimate parameters. While this will mean using the ODE model to solve for the other 5 state variables will have some error, it will be a much better result than the network alone.

RubADuckDuck commented 2 years ago

Thank you for your detailed answer! It helped a lot!!

ZSTanone commented 2 years ago

Thank you for your detailed answer! It helped a lot!! Do you have any comments on the output transform which was mentioned in this Sbinn paper ,I don't quite understand about it. Thanks

mitchelldaneker commented 2 years ago

There is a description in the paper, but essentially an output transform is done on the output of the network. For a simple description, imagine we have two outputs, A and B. We can use the output transform to do a few things, two of the main uses is scaling and applying hard constraints.

For the scaling, imagine B/1000 ~ O(A). If this is the case, the network may struggle to provide both outputs due to the order of magnitude difference. As a way around this, we can scale the variables such that the network outputs will be the same order of magnitude. To solve the issue in this simple case, we can look at the order of magnitude and realize if we multiply B by 1000 in the output transform, both will have the same order of magnitude within the network. This means that the network is actually predicting B/1000, and multiplying by 1000 will provide us with B. In the paper, we use the mean of the data as a means of scaling.

As for hard constraints, these are useful for applying initial or boundary conditions. Say that as an IC, at t = 0, B = 0 and A = 1. We could apply soft constraints via dde.IC. Hard constraints would be multiplying A and B by functions that would force them to always follow those initial conditions. For B, we may multiply by tanh(t) which is zero at t=0. For A, we may also multiply by tanh(0), but to give the IC we add exp(t). So the equation would be (A*tanh(t) + exp(t)), which satisfies the initial condition. The exact equations you use are dependent on the IC/BC and the system.

chenyv118 commented 1 year ago

There is a description in the paper, but essentially an output transform is done on the output of the network. For a simple description, imagine we have two outputs, A and B. We can use the output transform to do a few things, two of the main uses is scaling and applying hard constraints.

For the scaling, imagine B/1000 ~ O(A). If this is the case, the network may struggle to provide both outputs due to the order of magnitude difference. As a way around this, we can scale the variables such that the network outputs will be the same order of magnitude. To solve the issue in this simple case, we can look at the order of magnitude and realize if we multiply B by 1000 in the output transform, both will have the same order of magnitude within the network. This means that the network is actually predicting B/1000, and multiplying by 1000 will provide us with B. In the paper, we use the mean of the data as a means of scaling.

As for hard constraints, these are useful for applying initial or boundary conditions. Say that as an IC, at t = 0, B = 0 and A = 1. We could apply soft constraints via dde.IC. Hard constraints would be multiplying A and B by functions that would force them to always follow those initial conditions. For B, we may multiply by tanh(t) which is zero at t=0. For A, we may also multiply by tanh(0), but to give the IC we add exp(t). So the equation would be (A*tanh(t) + exp(t)), which satisfies the initial condition. The exact equations you use are dependent on the IC/BC and the system.

I found that the convergence speed generally becomes very slow after using the same method to implement hard constraints, is this a defect of hard constraints? Or the setting of the weight of the loss term or some other factors may have an effect on it?

1685498835750

It converges in about 100,000 or 200,000 cycles when I use soft constraints.

HGangloff commented 11 months ago

Hi @mitchelldaneker, It looks like I get the same results as the OP when running the pytorch script; only the G, for which we have observations, seems well estimated by the SBINN after the training. So to be clear, should we expect the other outputs to be estimated as neatly as G ? Or are the OP plots the final results which prove the usefulness of the ODE model ?

mitchelldaneker commented 11 months ago

Hi @mitchelldaneker, It looks like I get the same results as the OP when running the pytorch script; only the G, for which we have observations, seems well estimated by the SBINN after the training. So to be clear, should we expect the other outputs to be estimated as neatly as G ? Or are the OP plots the final results which prove the usefulness of the ODE model ?

Yes you can only "trust" G in this case, and that is a loose trust. Remember that G has data, so you can compare to that data and look at the trustworthiness in that sense. We have found that generally with inverse PINN, the parameters are learned long before the state variables are, hence the "loose trust". You may need to train 4-5x as long to get good results on the other state variables. In this case, since we have a standard method for solving the ODE model and we know it is fast, it would be better to plug those parameters into that solver and use those predictions.

mitchelldaneker commented 11 months ago

There is a description in the paper, but essentially an output transform is done on the output of the network. For a simple description, imagine we have two outputs, A and B. We can use the output transform to do a few things, two of the main uses is scaling and applying hard constraints. For the scaling, imagine B/1000 ~ O(A). If this is the case, the network may struggle to provide both outputs due to the order of magnitude difference. As a way around this, we can scale the variables such that the network outputs will be the same order of magnitude. To solve the issue in this simple case, we can look at the order of magnitude and realize if we multiply B by 1000 in the output transform, both will have the same order of magnitude within the network. This means that the network is actually predicting B/1000, and multiplying by 1000 will provide us with B. In the paper, we use the mean of the data as a means of scaling. As for hard constraints, these are useful for applying initial or boundary conditions. Say that as an IC, at t = 0, B = 0 and A = 1. We could apply soft constraints via dde.IC. Hard constraints would be multiplying A and B by functions that would force them to always follow those initial conditions. For B, we may multiply by tanh(t) which is zero at t=0. For A, we may also multiply by tanh(0), but to give the IC we add exp(t). So the equation would be (A*tanh(t) + exp(t)), which satisfies the initial condition. The exact equations you use are dependent on the IC/BC and the system.

I found that the convergence speed generally becomes very slow after using the same method to implement hard constraints, is this a defect of hard constraints? Or the setting of the weight of the loss term or some other factors may have an effect on it? 1685498835750 It converges in about 100,000 or 200,000 cycles when I use soft constraints.

Sorry for the late reply @chenyv118. By using hard constraints, you are changing the function from the very start. This could have a potent effect on the loss field, and thus could require a change in your weights. Generally, initialization methods will output near zero for the network outputs. If you do hard constraints, especially with linear scaling and addition like we do hear, this could have an effect on the loss and may require slightly different weights.