Transfer Learning - Githubissues

lululxvi / deepxde

A library for scientific machine learning and physics-informed learning

https://deepxde.readthedocs.io

GNU Lesser General Public License v2.1

2.64k stars 739 forks source link

Transfer Learning #188

Closed mariusmerkle closed 3 years ago

mariusmerkle commented 3 years ago

I would like to study transfer learning with PINNs in-depth and use the DeepXDE library for that. I thought of the following process:

Train a source model (model_source) on some differential equation, e.g. Burger's equation with nu = 0.01/pi
Save the trained source model using "loss_history, train_state = model_source.train(model_save_path = "Source_Model")"
Restore the model using "model_target.train(epochs = 10000, model_restore_path = "Source_Model-" + str(train_state.best_step))" and train the new target model (model_target) initialised with the trained source model on Burger's equation with similar nu, e.g. 0.02/pi

Unfortunately, I get an error, when restoring the model: "NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found. (0) Not found: Key beta1_power_21 not found in checkpoint [[node save_5/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/deepxde/model.py:205) ]] [[save_5/RestoreV2/_1279]] (1) Not found: Key beta1_power_21 not found in checkpoint [[node save_5/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/deepxde/model.py:205) ]] 0 successful operations."

I use Google Colab and you can access my notebook via https://colab.research.google.com/drive/1uGWGSIvMLOQ3--nO6i_eSSGukbFypvu5?usp=sharing where the error occurs in the second last cell. I have read across all previous threads of storing and restoring, but checkpoints were employed every time and I would like to use the more simple functionality of storing and restoring, where training is continued from a trained model. Do you have any ideas on how to fix the issue?

lululxvi commented 3 years ago

You may try to split the code of source model and the code of the target model into two separate files, and run the code one by one.

mariusmerkle commented 3 years ago

Great, that worked for me - I can now save and reload a model for further training. I have split the notebooks into

Source Model (https://colab.research.google.com/drive/1uGWGSIvMLOQ3--nO6i_eSSGukbFypvu5?usp=sharing)
Target Model (https://colab.research.google.com/drive/17miTTHzQ9jN8p2_mz7fwNa_H25eAQCBy?usp=sharing)

To measure the effectiveness of transfer learning, I would like to train two models (one with random initialization = untrained source model, one with non-random initialization = trained source model) and compare the iteration at which the train-/test-loss reaches a certain threshold, e.g. L_2 <= 10^(-5). How can I do that - I couldn't find any proper functionality in the documentation?

Besides, I don't understand the train and test losses: The three columns should correspond to PDE residual, boundary condition residual, and initial condition residual. But which column is representing which loss? And why does the test loss only have one column?

lululxvi commented 3 years ago

You can use the train loss for your purpose. The loss will be saved into loss.dat after the training.
The first column is PDE loss, and the remaining columns have the same order when you define the IC and BC.
The test loss only has the PDE loss, because usually the IC/BC can be trained well.

mariusmerkle commented 3 years ago

Thank you for your advice, @lululxvi . I am currently studying the the effect of transfer learning on the heat equation on different dimensions: 1D (space), 2D/3D (space and time).

In the 3D case, I have trouble to optimise the network. I use a hard constraint to satisfy boundary conditions exactly. For an initial condition $u(x, y, t = 0) = cos(0.5*pi*x)*cos(0.5*pi*x)$ and zero Dirichlet BC on all 4 sides x = ±1 and y = ±1, I use the output transformation

net.apply_output_transform(lambda x, u: u * x[:, 2:3]*(1 - x[:, 0:1]**2)*(1 - x[:, 1:2]**2) + tf.cos(0.5*np.pi*x[:, 0:1])*tf.cos(0.5*np.pi*x[:, 1:2]))

for which the distance function (1 - x[:, 0:1]**2)*(1 - x[:, 1:2]**2) is zero at the spatial boundary and increases smoothly towards the center of the spatial domain.

The PDE residual shows large errors at the edges and corners:

I already varied the number of epochs, network architecture, number of collocation points and can't avoid the large PDE residuals. Equally, for

net.apply_output_transform(lambda x, u: u * x[:, 2:3]*tf.cos(0.5*np.pi*x[:, 0:1])*tf.cos(0.5*np.pi*x[:, 1:2]) + tf.cos(0.5*np.pi*x[:, 0:1])*tf.cos(0.5*np.pi*x[:, 1:2]))

the PDE residual stays large. Residual-based adaptive refinement doesn't solve the issue, either. Do you have any ideas how to fix that issue? See my Google Colab notebook: https://colab.research.google.com/drive/1i_SbbFYwTpxSHhhA7vB9rxkyufXb42FN?usp=sharing

I would appreciate, if someone who has already employed boundary conditions as hard constraints, like @HHHMX, could share his experience as well.

lululxvi commented 3 years ago

How does the algorithm work for 1D and 2D case? Do you verify that the network satisfy the BCs?

mariusmerkle commented 3 years ago

Of course, I enforce the BCs by a hard constraint, in the 2D case (1D space, 1D time):

net.apply_output_transform(lambda x, u: u * x[:, 1:2]*(1 - x[:, 0:1]**2) - tf.sin(np.pi*x[:, 0:1]))

and in the 3D case (2D space, 1D time): net.apply_output_transform(lambda x, u: u * x[:, 2:3]*(1 - x[:, 0:1]**2)*(1 - x[:, 1:2]**2) + tf.cos(0.5*np.pi*x[:, 0:1])*tf.cos(0.5*np.pi*x[:, 1:2]))

to enforce zero Dirichlet boundary conditions and at x = ±1 and y±1 as well as the initial condition T(x, y, t = 0) = cos(pi/2x)*cos(pi/2y).

I suppose that in the 3D case (2D space, 1D time),

x[:,0:1] refers to the first spatial dimension
x[:,1:2] refers to the second spatial dimension
x[:,2:3] refers to the temporal dimension

Is that correct? @lululxvi

Anyway, I still didn't succeed in solving the time-dependent 2D space problem which you can access here:

https://colab.research.google.com/drive/1i_SbbFYwTpxSHhhA7vB9rxkyufXb42FN?usp=sharing

Are the following hyperparameters appropriate for the time-dependent 2D problem?

10.000 domain points
3 layers
20 hidden units
10.000 Adam iterations with learning rate 1e-3, then L-BFFS optimiser
tanh non-linear activation function

lululxvi commented 3 years ago

Yes, it is correct. If you can solve 1D space case, then I believe your code should be OK. You might tune carefully by checking is training loss too large? or training loss is small, but testing loss is large? The hyperparameters changes case by case. But 10000 Adam iterations seems too small, at least try 100000 for 3D case.

mariusmerkle commented 3 years ago

After 100.000 iterations, the model is definitely overfitting (train loss 7.72e-3, test loss 1.89e+00, l2 relative error 0.295).

To avoid overfitting, I tried to use more domain points, 100.000, and trained again for 100.000 iterations. Unfortunately, no improvements again: train loss 1.98e-02, test loss 7.72e+00, l2 relative error 0.276).

How can I avoid overfitting? See the code below @lululxvi

`import numpy as np import tensorflow as tf

import deepxde as dde import matplotlib.pyplot as plt

nu_ref = 0.1

x_min = 0 x_max = 1 y_min = 0 y_max = 1 t_min = 0 t_max = 1

def pde(x, u): du = tf.gradients(u, x)[0] ddu = tf.gradients(du, x)[0] u_t = du[:, 2:3] u_xx = ddu[:, 0:1] u_yy = ddu[:, 1:2] return u_t - nu_ref*(u_xx + u_yy)

def func(x): return np.sin(np.pix[:, 0:1])np.sin(np.pix[:, 1:2])np.exp(-2nu_refnp.pi*2x[:, 2:3])

geom = dde.geometry.Rectangle(xmin = [x_min, y_min], xmax = [x_max, y_max]) timedomain = dde.geometry.TimeDomain(t_min, t_max) geomtime = dde.geometry.GeometryXTime(geom, timedomain)

data = dde.data.TimePDE(geomtime, pde, [], num_domain=10000, solution=func, num_test=10000)

layer_size = [3] + [20] 3 + [1] activation = "tanh" initializer = "Glorot uniform" net = dde.maps.FNN(layer_size, activation, initializer) net.apply_output_transform(lambda x, u: u x[:, 2:3] x[:, 0:1] (1 - x[:, 0:1]) x[:, 1:2] (1 - x[:, 1:2]) + tf.sin(np.pix[:, 0:1])tf.sin(np.pi*x[:, 1:2]))

model = dde.Model(data, net)

model.compile("adam", lr=0.001, metrics=["l2 relative error"]) losshistory, train_state = model.train(epochs = 100000)

model.compile("L-BFGS-B") losshistory, train_state = model.train()

dde.saveplot(losshistory, train_state, issave = True, isplot = True)

resolution = 100 x = np.linspace(x_min, x_max, resolution) y = np.linspace(y_min, y_max, resolution) t = np.linspace(t_min, t_max, resolution) xx, yy, tt = np.meshgrid(x, y, t, indexing = 'ij') Z = np.vstack((np.ravel(xx), np.ravel(yy), np.ravel(tt))).T

y_predicted = model.predict(Z).reshape(resolution, resolution, resolution) f = model.predict(Z, operator=pde).reshape(resolution, resolution, resolution)

y_true = np.sin(np.pixx)np.sin(np.piyy)np.exp(-2nu_refnp.pi*2tt) print("Mean residual:", np.mean(np.absolute(f))) print("Relative L2 difference:", dde.metrics.l2_relative_error(y_true, y_predicted))

time_step = 60 ax = plt.axes(projection="3d") ax.plot_wireframe(xx[:, :, time_step], yy[:, :, time_step], abs(y_predicted[:, :, time_step] - y_true[:, :, time_step]))

ax.view_init(30, 45)`

mariusmerkle commented 3 years ago

Could you clarify what you mean by tuning? I varied several hyperparameters (number of hidden layers, number of hidden units as well as domain points) but didn't achieve a good relative L2 loss with any configuration, see below:

engsbk commented 3 years ago

I'm not sure what might be the problem, but can you try:

Adding more layers: net = dde.maps.FNN([3] + 1*[20] + [1], "tanh", "Glorot normal") # Glorot normal = Xavier initialiser maybe 3 or 4 instead of 1..
From my trials, it's better to run the code on smaller intervals, but you can extend your time interval to 1 at least instead of 0.1 temporal_domain = dde.geometry.TimeDomain(0, 0.1)
I created this code snippet to slice through time and space to check the difference between the exact and predicted solution for the 1D case you can change it for 2D and use it or simply use it for you 1D examples:

data_file = 'The_file_name_of_your_exact_solution'

data = np.load('dataset/'+data_file+'.npz')
t, x, exact = data["t_input"], data["x_input"], data["u_output"].T
xx, tt = np.meshgrid(x, t)
X = np.vstack((np.ravel(xx), np.ravel(tt))).T
y = exact.flatten()[:, None]

print("exact",exact.shape)
print("u_pred",u_pred.shape)

X_star = np.hstack((xx.flatten()[:,None], tt.flatten()[:,None]))
U_pred = griddata(X_star, u_pred.flatten(), (xx, tt), method='cubic')

######################################Create figure for subplots########################################

fig = plt.figure(figsize=(500, 250), dpi=80)
fig.subplots_adjust(left=0.1, bottom=0.5, right=0.125, top=0.6,
                wspace=0.3, hspace=None)

###################################### time slices' values ###############################################
num_slices = 4     # without first and last slice
slices = [[0, 0]]  #first slice

for k in range(1,num_slices):
    slices += [[int(k*len(t)/num_slices), int(k*len(x)/num_slices)]]

slices += [[int(len(t)-1),int(len(x)-1)]] #last slice
################################################ Slices #################################################

for t_slice,x_slice in slices:
    fig = plt.figure(figsize=(500, 150), dpi=80)
    fig.subplots_adjust(left=0.1, bottom=0.5, right=0.125, top=0.6,
                wspace=0.3, hspace=None)
    #########In case scaling is required #######
    scale = 1  #can be 100, 1000.....
    ###################################
    ax = fig.add_subplot(len(slices),2, 1)
    ax.set_xlabel('x',fontsize = 15)
    ax.set_ylabel('u(t,x)')
    ax.plot(x,exact[t_slice,:],'b-')
    ax.plot(x,scale * U_pred[t_slice,:],'r--')
    ax.set_xlim(x[0], x[len(x)-1])
    ax.set_ylim(exact.min()-1, exact.max()+1)
    ax.set_title('$t = %1.3f$ s'%(t[t_slice]), fontsize = 15)

    ax = fig.add_subplot(len(slices),2, 2 )
    ax.set_xlabel('t',fontsize = 15)
    ax.set_ylabel('u(t,x)',fontsize = 15)
    ax.plot(t,exact[:,x_slice],'b-', label = 'Exact')
    ax.plot(t,scale * U_pred[:,x_slice],'r--', label = 'Prediction')
    ax.set_xlim(t[0], t[len(t)-1])
    ax.set_ylim(exact.min()-1, exact.max()+1)

    ax.set_title('$x = %1.3f$ '%(x[x_slice]), fontsize = 15)

    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.35), ncol=5, frameon=False)
    plt.show()

I'm still working on refining my 2D wave code, I will be happy to share it once I'm done.

mariusmerkle commented 3 years ago

Hi @engsbk,

Thank you for your recommendations!

I used many possible hyperparameter configurations, from 1 to 8 hidden layers, 10 to 40 hidden units, different activation functions, number of domain points 1000 to 50000, etc. So I ran the code with 3/4 hidden layers as well and didn't achieve good accuracy.

As a check, I implemented the same time-dependent 2D heat diffusion equation in SciANN, another deep learning library for scientific computations with PINNs, and achieve very good accuracy. https://colab.research.google.com/drive/10PIdjUzUJoqjK-Yc3I8fm7jStvL1721f?usp=sharing

After 1000 epochs and simple network architectures, for example 2 hidden layers with 20 units each and 1000 domain points, the relative L2 difference is 0.01. With different hyper parameter configurations, it is possible to bring the loss to the order of 10^-4. Unfortunately, the best possible relative L2 difference in DeepXDE was 0.3. In case anyone finds out the problem, I would be glad as I really enjoy working with DeepXDE.

lululxvi commented 3 years ago

@mariusmerkle You have a mistake in the PDE definition. After correcting that, the L2 relative error is 5e-5

import matplotlib.pyplot as plt
import numpy as np

import deepxde as dde
from deepxde.backend import tf

nu_ref = 0.1

x_min = 0
x_max = 1
y_min = 0
y_max = 1
t_min = 0
t_max = 1

def pde(x, u):
    u_t = dde.grad.jacobian(u, x, j=2)
    u_xx = dde.grad.hessian(u, x, i=0, j=0)
    u_yy = dde.grad.hessian(u, x, i=1, j=1)
    return u_t - nu_ref * (u_xx + u_yy)

def func(x):
    return (
        np.sin(np.pi * x[:, 0:1])
        * np.sin(np.pi * x[:, 1:2])
        * np.exp(-2 * nu_ref * np.pi ** 2 * x[:, 2:3])
    )

geom = dde.geometry.Rectangle(xmin=[x_min, y_min], xmax=[x_max, y_max])
timedomain = dde.geometry.TimeDomain(t_min, t_max)
geomtime = dde.geometry.GeometryXTime(geom, timedomain)

data = dde.data.TimePDE(
    geomtime, pde, [], num_domain=200000, solution=func, num_test=1000000
)

layer_size = [3] + [20] * 3 + [1]
activation = "tanh"
initializer = "Glorot uniform"
net = dde.maps.FNN(layer_size, activation, initializer)
net.apply_output_transform(
    lambda x, u: u
    * x[:, 2:3]
    * x[:, 0:1]
    * (1 - x[:, 0:1])
    * x[:, 1:2]
    * (1 - x[:, 1:2])
    + tf.sin(np.pi * x[:, 0:1]) * tf.sin(np.pi * x[:, 1:2])
)

model = dde.Model(data, net)

model.compile("adam", lr=0.001, metrics=["l2 relative error"])
losshistory, train_state = model.train(epochs=100000)
model.compile("L-BFGS-B")
losshistory, train_state = model.train()

dde.saveplot(losshistory, train_state, issave=True, isplot=True)

mariusmerkle commented 3 years ago

Thanks a lot! @lululxvi I achieve excellent accuracy as well, perfect. Now I understand the problem: I used

def pde_source(x, u): u_xx = dde.grad.hessian(u, x, i = 0, j = 0) u_yy = dde.grad.hessian(u, x, i = 0, j = 1) u_t = dde.grad.jacobian(u, x, i = 2) return u_t - nu*(u_xx + u_yy)

when u_yy was actually (du/dxdy) so rather u_xy). Due to this wrong Hessian, everything was wrong.

So in dde.grad.hessian, i refers to the index of x with respect to which u is differentiated first. j refers to the index of x with respect to which u is differentiated then, right? @lululxvi

lululxvi commented 3 years ago

Yes, Hessian matrix H is H[i][j] = d^2y / dx_i dx_j.

engsbk commented 2 years ago

Thanks a lot! @lululxvi I achieve excellent accuracy as well, perfect. Now I understand the problem: I used

def pde_source(x, u): u_xx = dde.grad.hessian(u, x, i = 0, j = 0) u_yy = dde.grad.hessian(u, x, i = 0, j = 1) u_t = dde.grad.jacobian(u, x, i = 2) return u_t - nu*(u_xx + u_yy)

when u_yy was actually (du/dxdy) so rather u_xy). Due to this wrong Hessian, everything was wrong.

So in dde.grad.hessian, i refers to the index of x with respect to which u is differentiated first. j refers to the index of x with respect to which u is differentiated then, right? @lululxvi

I'm glad you managed to solve this @mariusmerkle ! I'm trying to do something similar to the wave equation with a source term. Can you share your code about how did you train for (1D space, 1D time) and then transfer this learning model to (2D space, 1D time)?

mariusmerkle commented 2 years ago

Hi @engsbk, I'm not sure we are talking about the same idea here. I never transferred a model between two different spatio-temporal domains. I always keep spacial and temporal dimensions across transfer problems.

engsbk commented 2 years ago

Thank you for your reply, Marius! Let me know if you have any suggestion for my problem of the the 2D + 1D wave equation.

On Mar 27, 2022, at 6:43 PM, Marius Merkle @.**@.>> wrote:

Hi @engsbkhttps://github.com/engsbk, I'm not sure we are talking about the same idea here. I never transferred a model between two different spatio-temporal domains. I always keep spacial and temporal dimensions across transfer problems.

— Reply to this email directly, view it on GitHubhttps://github.com/lululxvi/deepxde/issues/188#issuecomment-1080034852, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGBAUNVN7XPNFP6B43HT5Y3VCDQC3ANCNFSM4VJZ6QTQ. You are receiving this because you were mentioned.Message ID: @.***>

FZUcipher commented 1 year ago

I'm not sure what might be the problem, but can you try:

Adding more layers: net = dde.maps.FNN([3] + 1*[20] + [1], "tanh", "Glorot normal") # Glorot normal = Xavier initialiser maybe 3 or 4 instead of 1..
From my trials, it's better to run the code on smaller intervals, but you can extend your time interval to 1 at least instead of 0.1 temporal_domain = dde.geometry.TimeDomain(0, 0.1)
I created this code snippet to slice through time and space to check the difference between the exact and predicted solution for the 1D case you can change it for 2D and use it or simply use it for you 1D examples:

data_file = 'The_file_name_of_your_exact_solution'

data = np.load('dataset/'+data_file+'.npz')
t, x, exact = data["t_input"], data["x_input"], data["u_output"].T
xx, tt = np.meshgrid(x, t)
X = np.vstack((np.ravel(xx), np.ravel(tt))).T
y = exact.flatten()[:, None]

print("exact",exact.shape)
print("u_pred",u_pred.shape)

X_star = np.hstack((xx.flatten()[:,None], tt.flatten()[:,None]))
U_pred = griddata(X_star, u_pred.flatten(), (xx, tt), method='cubic')

######################################Create figure for subplots########################################

fig = plt.figure(figsize=(500, 250), dpi=80)
fig.subplots_adjust(left=0.1, bottom=0.5, right=0.125, top=0.6,
                wspace=0.3, hspace=None)

###################################### time slices' values ###############################################
num_slices = 4     # without first and last slice
slices = [[0, 0]]  #first slice

for k in range(1,num_slices):
    slices += [[int(k*len(t)/num_slices), int(k*len(x)/num_slices)]]

slices += [[int(len(t)-1),int(len(x)-1)]] #last slice
################################################ Slices #################################################

for t_slice,x_slice in slices:
    fig = plt.figure(figsize=(500, 150), dpi=80)
    fig.subplots_adjust(left=0.1, bottom=0.5, right=0.125, top=0.6,
                wspace=0.3, hspace=None)
    #########In case scaling is required #######
    scale = 1  #can be 100, 1000.....
    ###################################
    ax = fig.add_subplot(len(slices),2, 1)
    ax.set_xlabel('x',fontsize = 15)
    ax.set_ylabel('u(t,x)')
    ax.plot(x,exact[t_slice,:],'b-')
    ax.plot(x,scale * U_pred[t_slice,:],'r--')
    ax.set_xlim(x[0], x[len(x)-1])
    ax.set_ylim(exact.min()-1, exact.max()+1)
    ax.set_title('$t = %1.3f$ s'%(t[t_slice]), fontsize = 15)

    ax = fig.add_subplot(len(slices),2, 2 )
    ax.set_xlabel('t',fontsize = 15)
    ax.set_ylabel('u(t,x)',fontsize = 15)
    ax.plot(t,exact[:,x_slice],'b-', label = 'Exact')
    ax.plot(t,scale * U_pred[:,x_slice],'r--', label = 'Prediction')
    ax.set_xlim(t[0], t[len(t)-1])
    ax.set_ylim(exact.min()-1, exact.max()+1)

    ax.set_title('$x = %1.3f$ '%(x[x_slice]), fontsize = 15)

    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.35), ncol=5, frameon=False)
    plt.show()

I'm still working on refining my 2D wave code, I will be happy to share it once I'm done.

May I ask if you have shared your code perfectly, where can I see it?