batch normalization: loss increased when deepxde.maps.map.Map.training set to False

smao-astro commented 4 years ago

Hi Lu,

I am trying deepxde.maps.fnn.FNN.batch_normalization at https://github.com/lululxvi/deepxde/blob/8e811adfca060766dcbbaeec59c30300be134d00/deepxde/maps/fnn.py#L27 And I noticed that when https://github.com/lululxvi/deepxde/blob/8e811adfca060766dcbbaeec59c30300be134d00/deepxde/model.py#L253 is called, the loss increased significantly compared to the loss computed at https://github.com/lululxvi/deepxde/blob/8e811adfca060766dcbbaeec59c30300be134d00/deepxde/model.py#L243 To reproduce, here are two scripts comparing with each other:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import deepxde as dde
from deepxde.backend import tf

def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y

def main():
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(
        geomtime, lambda x: np.zeros((len(x), 1)), lambda _, on_boundary: on_boundary
    )
    # lambda _, on_boundary: on_boundary
    #   an anonymous function, it take two inputs and no matter what the first imput is, return the second input only
    #   and identically
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN([2] + [20] * 3 + [1], "tanh", "Glorot normal")
    model = dde.Model(data, net)

    # we first use Adam for a certain number of iterations, and then switch to L-BFGS.
    model.compile("adam", lr=1e-3)
    model.train(epochs=1500)
    # The optimizer L-BFGS does not require learning rate,
    # and the neural network is trained until convergence,
    # so the number of iterations is also ignored for L-BFGS.
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train(display_every=200)
    # dde.saveplot(losshistory, train_state, issave=True, isplot=True)
    #
    # X, y_true = gen_testdata()
    # y_pred = model.predict(X)
    # f = model.predict(X, operator=pde)
    # print("Mean residual:", np.mean(np.absolute(f)))
    # print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    # np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))

if __name__ == "__main__":
    main()

without batch_normalization, it gives

Using TensorFlow 1 backend.

Compiling model... Building feed-forward neural network... 'build' took 0.305456 s

'compile' took 0.732757 s

Initializing variables... Training model...

Step Train loss Test loss Test metric 0 [4.72e-03, 8.26e-03, 4.37e-01] [4.72e-03, 0.00e+00, 0.00e+00] []
1000 [3.72e-02, 2.52e-04, 5.29e-02] [3.72e-02, 0.00e+00, 0.00e+00] []
1500 [3.41e-02, 1.06e-04, 4.87e-02] [3.41e-02, 0.00e+00, 0.00e+00] []

Best model at step 1500: train loss: 8.29e-02 test loss: 3.41e-02 test metric: []

'train' took 6.344179 s

Compiling model... 'compile' took 1.180172 s

Training model...

Step Train loss Test loss Test metric 1500 [3.41e-02, 1.06e-04, 4.87e-02] [3.41e-02, 0.00e+00, 0.00e+00] []
1600 [2.18e-02, 2.19e-04, 2.46e-02]
1800 [8.66e-03, 1.41e-05, 9.62e-03]
2000 [6.09e-03, 4.42e-06, 3.73e-03]
2200 [3.99e-03, 6.30e-06, 3.37e-03]
2400 [2.94e-03, 9.06e-06, 3.06e-03]
2600 [2.20e-03, 8.50e-06, 2.66e-03]
2800 [1.56e-03, 2.16e-06, 2.32e-03]
3000 [1.29e-03, 1.07e-06, 2.16e-03]
3200 [1.08e-03, 2.86e-06, 2.11e-03]
3400 [9.43e-04, 9.06e-07, 2.08e-03]
3600 [8.71e-04, 2.73e-06, 2.01e-03]
3800 [8.47e-04, 1.08e-06, 1.68e-03]
4000 [4.21e-04, 9.31e-07, 4.50e-04]
4200 [3.30e-04, 6.53e-07, 4.29e-04]
4400 [2.89e-04, 8.89e-07, 4.05e-04]
4600 [2.69e-04, 5.40e-07, 3.80e-04]
4800 [2.55e-04, 2.23e-07, 3.47e-04]
5000 [2.21e-04, 4.14e-07, 3.26e-04]
5200 [2.22e-04, 4.91e-07, 2.95e-04]
5400 [2.27e-04, 1.82e-07, 2.45e-04]
5600 [2.25e-04, 2.67e-07, 1.60e-04]
5800 [1.89e-04, 1.21e-07, 8.47e-05]
6000 [1.45e-04, 1.27e-07, 6.40e-05]
6200 [1.28e-04, 2.80e-07, 5.85e-05]
6400 [1.10e-04, 9.07e-08, 4.82e-05]
6600 [9.93e-05, 9.26e-08, 4.33e-05]
6800 [8.91e-05, 8.64e-08, 3.82e-05]
7000 [8.05e-05, 8.08e-08, 3.42e-05]
7200 [7.32e-05, 9.41e-08, 3.23e-05]
7400 [6.76e-05, 1.27e-07, 2.91e-05]
7600 [6.43e-05, 9.91e-08, 2.38e-05]
7800 [5.84e-05, 8.04e-08, 2.04e-05]
8000 [5.45e-05, 1.05e-07, 1.71e-05]
8200 [5.21e-05, 1.18e-07, 1.51e-05]
8400 [4.81e-05, 1.78e-07, 1.42e-05]
8600 [4.60e-05, 1.55e-07, 1.17e-05]
8800 [4.29e-05, 2.24e-07, 9.72e-06]
9000 [3.96e-05, 1.27e-07, 7.74e-06]
9200 [3.77e-05, 1.05e-07, 6.13e-06]
9400 [3.47e-05, 1.12e-07, 3.55e-06]
9600 [3.04e-05, 1.89e-07, 2.58e-06]
9800 [2.74e-05, 1.61e-07, 2.34e-06]
10000 [2.61e-05, 1.09e-07, 2.13e-06]
10200 [2.44e-05, 5.00e-08, 2.04e-06]
10400 [2.33e-05, 3.94e-08, 1.81e-06]
10600 [2.22e-05, 4.27e-08, 1.65e-06]
10800 [2.10e-05, 3.46e-08, 1.35e-06]
11000 [2.00e-05, 3.83e-08, 1.06e-06]
11200 [1.86e-05, 5.09e-08, 1.06e-06]
11400 [1.77e-05, 5.11e-08, 1.05e-06]
11600 [1.68e-05, 4.61e-08, 1.17e-06]
11702 [1.65e-05, 4.23e-08, 1.13e-06] [1.65e-05, 0.00e+00, 0.00e+00] []

Best model at step 11702: train loss: 1.76e-05 test loss: 1.65e-05 test metric: []

'train' took 49.402318 s

to compare, if I add batch_normalization

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import deepxde as dde
from deepxde.backend import tf

def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y

def main():
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(
        geomtime, lambda x: np.zeros((len(x), 1)), lambda _, on_boundary: on_boundary
    )
    # lambda _, on_boundary: on_boundary
    #   an anonymous function, it take two inputs and no matter what the first imput is, return the second input only
    #   and identically
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN([2] + [20] * 3 + [1], "tanh", "Glorot normal", batch_normalization="before")
    model = dde.Model(data, net)

    # we first use Adam for a certain number of iterations, and then switch to L-BFGS.
    model.compile("adam", lr=1e-3)
    model.train(epochs=1500)
    # The optimizer L-BFGS does not require learning rate,
    # and the neural network is trained until convergence,
    # so the number of iterations is also ignored for L-BFGS.
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train(display_every=200)
    # dde.saveplot(losshistory, train_state, issave=True, isplot=True)
    #
    # X, y_true = gen_testdata()
    # y_pred = model.predict(X)
    # f = model.predict(X, operator=pde)
    # print("Mean residual:", np.mean(np.absolute(f)))
    # print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    # np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))

if __name__ == "__main__":
    main()

the output is

Using TensorFlow 1 backend.

Compiling model... Building feed-forward neural network... 'build' took 0.459043 s

'compile' took 3.913206 s

Initializing variables... Training model...

Step Train loss Test loss Test metric 0 [3.33e-02, 2.57e-01, 2.37e-01] [3.33e-02, 0.00e+00, 0.00e+00] []
1000 [1.20e+00, 7.88e-04, 7.63e-04] [1.20e+00, 0.00e+00, 0.00e+00] []
1500 [1.11e+00, 4.78e-04, 3.57e-04] [1.11e+00, 0.00e+00, 0.00e+00] []

Best model at step 0: train loss: 5.27e-01 test loss: 3.33e-02 test metric: []

'train' took 21.294190 s

Compiling model... 'compile' took 3.631830 s

Training model...

Step Train loss Test loss Test metric 1500 [1.11e+00, 4.78e-04, 3.57e-04] [1.11e+00, 0.00e+00, 0.00e+00] []
1600 [1.81e-04, 4.00e-05, 4.24e-05]
1800 [7.96e-05, 1.73e-05, 2.90e-05]
2000 [3.92e-05, 1.16e-05, 2.35e-05]
2200 [2.34e-05, 1.08e-05, 1.87e-05]
2400 [1.92e-05, 8.32e-06, 1.26e-05]
2484 [2.58e+00, 4.39e-02, 1.84e-02] [2.58e+00, 0.00e+00, 0.00e+00] []

Best model at step 0: train loss: 5.27e-01 test loss: 3.33e-02 test metric: []

'train' took 16.359440 s

notice that at step = 2484 the loss increased three order of magnitude.

I am guessing that mean and standard deviation from training is either not properly stored or not properly reused when testing. Any idea? Thanks!

smao-astro commented 4 years ago

By the way, #68 does used batch_normalization so the output might influenced by the issue mentioned here.

lululxvi commented 4 years ago

The code of batch_normalization should be correct. You can try a simple function approximation for testing. For PDEs, it is a little complicated, and I am not sure whether batch normalization would help or not. It becomes even more complicated when using L-BFGS, which is a quasi-Newton method. I usually don't use batch norm for PDEs. Let me know if you have any specific reason to use batch norm.

smao-astro commented 4 years ago

While the code above was just modified from Burgers.py
I hope batch norm could helps since the spatial coordinates and time coordinates are usually not of the same magnitude (expect I manually choose the units to make so).

lululxvi commented 4 years ago

You can also check other examples, and you will find that L-BFGS is not used too much.
In your case, it seems to me that x and t are of the same magnitude.
There are two ways to handle the input scaling:
- Recommendation: Re-scale your PDE if possible.
- Use apply_feature_transform to rescale the network inputs, e.g., if x is in [0, a], and t in [0, b]
```
net = ...
net.apply_feature_transform(lambda X: tf.concat([X[:, 0:1] / a, X[:, 0:1] / b)], axis=1))
```

smao-astro commented 4 years ago

Hi Lu,

I see. Thank you for you reply. I will temporarily close this issue since I also did not spot anything wrong with the code of batch normalization.

smao-astro commented 3 years ago

The issue still exists when using batch_normalization = "before" AND "L-BFGS-B", and I doubt this is due to update_ops is not executed when is_scipy_opts(optimizer) is True

https://github.com/lululxvi/deepxde/blob/f8ab0b06ad1be019ec48433af0a2e9cee8e711bc/deepxde/train.py#L17-L47

smao-astro commented 3 years ago

import numpy as np

import deepxde as dde
from deepxde.backend import tf

import sys

def gen_testdata():
    data = np.load("dataset/Burgers.npz")
    t, x, exact = data["t"], data["x"], data["usol"].T
    xx, tt = np.meshgrid(x, t)
    X = np.vstack((np.ravel(xx), np.ravel(tt))).T
    y = exact.flatten()[:, None]
    return X, y

def main(batch_normalization):
    def pde(x, y):
        dy_x = tf.gradients(y, x)[0]
        dy_x, dy_t = dy_x[:, 0:1], dy_x[:, 1:2]
        dy_xx = tf.gradients(dy_x, x)[0][:, 0:1]
        return dy_t + y * dy_x - 0.01 / np.pi * dy_xx

    geom = dde.geometry.Interval(-1, 1)
    timedomain = dde.geometry.TimeDomain(0, 0.99)
    geomtime = dde.geometry.GeometryXTime(geom, timedomain)

    bc = dde.DirichletBC(geomtime, lambda x: 0, lambda _, on_boundary: on_boundary)
    ic = dde.IC(
        geomtime, lambda x: -np.sin(np.pi * x[:, 0:1]), lambda _, on_initial: on_initial
    )

    data = dde.data.TimePDE(
        geomtime, pde, [bc, ic], num_domain=2540, num_boundary=80, num_initial=160
    )
    net = dde.maps.FNN(
        [2] + [20] * 3 + [1],
        "tanh",
        "Glorot normal",
        batch_normalization=batch_normalization,
    )
    model = dde.Model(data, net)

    # model.compile("adam", lr=1e-4)
    # model.train(epochs=1500)
    model.compile("L-BFGS-B")
    losshistory, train_state = model.train()
    dde.saveplot(losshistory, train_state, issave=True, isplot=True)

    X, y_true = gen_testdata()
    y_pred = model.predict(X)
    f = model.predict(X, operator=pde)
    print("Mean residual:", np.mean(np.absolute(f)))
    print("L2 relative error:", dde.metrics.l2_relative_error(y_true, y_pred))
    np.savetxt("test.dat", np.hstack((X, y_true, y_pred)))

if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else None)

Run with before as command line args, get loss_curve

The issue is that the test loss is much larger than train loss.

lululxvi commented 3 years ago

Yes, "L-BFGS-B" does not work with "batch_normalization", because "L-BFGS-B" is from scipy. But the TensorFlow optimizers should work.

smao-astro commented 3 years ago

No, it does not (correct me if I am wrong). Applying batch_normalization="before" to examples/diffusion_1d.py gives diffusion_1d_loss_curve

lululxvi commented 3 years ago

I am not sure whether it makes sense to use batch-norm, because here we want to compute the derivatives dy/dx. My suggestion is that you may just remove batch-norm. We have worked on many different cases, and we never use batch-norm (the main purpose of batch-norm is for deep networks). There are always other ways.

smao-astro commented 3 years ago

Hi Lu,

I see, I agree with you that one should be careful when using batch normalization is such case. Thank you for your reply!

lululxvi / deepxde

batch normalization: loss increased when deepxde.maps.map.Map.training set to False #69