lululxvi / deepxde

A library for scientific machine learning and physics-informed learning
https://deepxde.readthedocs.io
GNU Lesser General Public License v2.1
2.73k stars 756 forks source link

NaN in MfNN example #571

Open marcnunezc opened 2 years ago

marcnunezc commented 2 years ago

Hi,

When running the example that reads from a dataset (the function version works fine):

https://github.com/lululxvi/deepxde/blob/master/examples/function/mf_dataset.py

I get a NaN in one of the test loss outputs:

Using backend: tensorflow.compat.v1
2022-03-16 14:44:58.717643: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-16 14:44:58.717690: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From /home/marc/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /home/marc/.local/lib/python3.8/site-packages/deepxde/nn/initializers.py:118: The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

Compiling model...
Building multifidelity neural network...
/home/marc/.local/lib/python3.8/site-packages/deepxde/nn/tensorflow_compat_v1/mfnn.py:114: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.
  return tf.layers.dense(
/home/marc/.local/lib/python3.8/site-packages/keras/legacy_tf_layers/core.py:255: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  return layer.apply(inputs)
'build' took 0.129456 s

2022-03-16 14:45:01.859670: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (100)
2022-03-16 14:45:01.859725: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (MARC-PC): /proc/driver/nvidia/version does not exist
2022-03-16 14:45:01.859979: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
'compile' took 0.447103 s

Initializing variables...
Training model...

Step      Train loss                        Test loss                         Test metric             
0         [4.16e+01, 6.23e+01, 8.44e-01]    [nan, 2.13e+01, 8.44e-01]         [1.01e+00, 1.03e+00]    

Best model at step 0:
  train loss: 1.05e+02
  test loss: nan
  test metric: [1.01e+00, 1.03e+00]

'train' took 0.228125 s

Is there a problem with my env setup?

I am running with:

Thanks

lululxvi commented 2 years ago

It is as expected because we don't have test data for low-fidelity.

marcnunezc commented 2 years ago

Then the code runs only for 1 epoch and then it stops. Is that expected for the example?

I see this is set here

https://github.com/lululxvi/deepxde/blob/303ae8067d86b0b38ab06dd5701e51e17f685206/deepxde/model.py#L579-L583

Is there a way to set up the model at initialization so that the run does not stop?

lululxvi commented 2 years ago

Yes, you are right. Please install the updated version v1.1.2.

If you install DeepXDE>1.1.2, such as 1.1.3, then in order to have exactly the same behavior as before, see https://github.com/lululxvi/deepxde/releases/tag/v1.1.3

marcnunezc commented 2 years ago

Thank you, that works.

Final question:

The figure obtained for the dataset run is: output_dataset

In comparison, this is the one obtained for the function version:

output_func

For the upper curve, some training dots are 0. Is this correct?

lululxvi commented 2 years ago

You can ignore those points.