apply_output/input_transform exceeds GPU memory with tensorflow 2, but works with tf.compat.v1

Hello, first, thanks for your excellent work.

Running the Lotka-Volterra Demo using DDE_BACKEND=tensorflow.compat.v1 works fine. Same program fails, when setting DDE_BACKEND=tensorflow. The computation seems to stall on the first iteration, GPU memory grows slowly, resulting in an out-of-memory error after a few minutes. During that process, there is CPU load but no GPU load (according to nvtop, only GPU memory usage). The computation runs without errors when commenting the lines

net.apply_feature_transform(input_transform)
net.apply_output_transform(output_transform)

(although of course results are not meaningful.)

Versions used: deepxde 1.5.0 (from pypi) tensorflow 2.9 CUDA 11.2, cuDNN 8.4

Thank you!

======

Output with tensorflow.compat.v1:

Using backend: tensorflow.compat.v1

2022-06-10 11:55:12.186229: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.

WARNING:tensorflow:From /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/deepxde/nn/initializers.py:118: The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

Compiling model...
Building feed-forward neural network...
'build' took 0.065410 s

/scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/skopt/sampler/sobol.py:246: UserWarning: The balance properties of Sobol' points require n to be a power of 2. 0 points have been previously generated, then: n=0+3002=3002. 
  warnings.warn("The balance properties of Sobol' points require "
/scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/deepxde/nn/tensorflow_compat_v1/fnn.py:103: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.
  return tf.layers.dense(
2022-06-10 11:55:14.784721: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-10 11:55:15.215133: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-06-10 11:55:15.215180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5824 MB memory:  -> device: 0, name: Quadro RTX 4000, pci bus id: 0000:65:00.0, compute capability: 7.5

'compile' took 1.154730 s

Initializing variables...
Training model...

2022-06-10 11:55:15.886233: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-06-10 11:55:16.076309: I tensorflow/compiler/xla/service/service.cc:170] XLA service 0x7f37b0008d50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-06-10 11:55:16.076364: I tensorflow/compiler/xla/service/service.cc:178]   StreamExecutor device (0): Quadro RTX 4000, Compute Capability 7.5
2022-06-10 11:55:16.103385: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:263] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2022-06-10 11:55:18.370229: I tensorflow/stream_executor/gpu/asm_compiler.cc:323] ptxas warning : Registers are spilled to local memory in function 'input_fusion_reduce_1'

2022-06-10 11:55:18.380061: I tensorflow/compiler/jit/xla_compilation_cache.cc:478] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

Step      Train loss              Test loss               Test metric
0         [3.09e+02, 2.91e+01]    [3.08e+02, 2.91e+01]    []  

2022-06-10 11:55:20.208391: I tensorflow/stream_executor/gpu/asm_compiler.cc:323] ptxas warning : Registers are spilled to local memory in function 'input_fusion_reduce_1'

1000      [1.81e+00, 6.31e-01]    [1.79e+00, 6.28e-01]    []  
2000      [1.93e+00, 5.25e-01]    [1.91e+00, 5.23e-01]    []  
3000      [1.68e+00, 4.59e-01]    [1.66e+00, 4.57e-01]    []  
4000      [1.47e+00, 4.57e-01]    [1.45e+00, 4.55e-01]    []  
5000      [1.27e+00, 4.53e-01]    [1.25e+00, 4.51e-01]    []  
6000      [1.08e+00, 4.46e-01]    [1.06e+00, 4.43e-01]    []  
7000      [8.91e-01, 4.28e-01]    [8.79e-01, 4.25e-01]    []  
8000      [7.18e-01, 3.95e-01]    [7.07e-01, 3.92e-01]    []  
9000      [5.46e-01, 3.33e-01]    [5.38e-01, 3.30e-01]    []  
10000     [3.82e-01, 2.72e-01]    [3.77e-01, 2.69e-01]    []  
11000     [2.29e-01, 2.28e-01]    [2.26e-01, 2.26e-01]    []  
12000     [1.35e-01, 1.80e-01]    [1.34e-01, 1.79e-01]    []  
13000     [1.57e-01, 1.60e-01]    [1.57e-01, 1.59e-01]    []  
14000     [9.35e-02, 1.12e-01]    [9.34e-02, 1.12e-01]    []  
15000     [5.74e-02, 8.38e-02]    [5.73e-02, 8.36e-02]    []  
16000     [1.08e-01, 7.89e-02]    [1.09e-01, 7.88e-02]    []  
17000     [4.50e-02, 7.58e-02]    [4.50e-02, 7.57e-02]    []  
18000     [4.19e-02, 6.13e-02]    [4.19e-02, 6.12e-02]    []  
19000     [6.88e-02, 5.39e-02]    [6.88e-02, 5.38e-02]    []  
20000     [4.00e-02, 5.23e-02]    [4.00e-02, 5.23e-02]    []  
21000     [2.69e-02, 4.24e-02]    [2.69e-02, 4.24e-02]    []  
22000     [2.12e-02, 3.23e-02]    [2.12e-02, 3.22e-02]    []  
23000     [3.12e-02, 3.43e-02]    [3.11e-02, 3.43e-02]    []  
24000     [1.35e-02, 1.92e-02]    [1.35e-02, 1.92e-02]    []  
25000     [1.68e-02, 2.61e-02]    [1.67e-02, 2.61e-02]    []  
26000     [9.19e-03, 1.05e-02]    [9.17e-03, 1.05e-02]    []  
27000     [7.65e-03, 8.81e-03]    [7.63e-03, 8.80e-03]    []  
28000     [3.68e-02, 1.56e-02]    [3.68e-02, 1.56e-02]    []  
29000     [8.25e-03, 1.05e-02]    [8.22e-03, 1.05e-02]    []  
30000     [8.49e-03, 8.45e-03]    [8.49e-03, 8.45e-03]    []  
31000     [4.84e-03, 5.18e-03]    [4.83e-03, 5.17e-03]    []  
32000     [4.47e-03, 5.79e-03]    [4.46e-03, 5.78e-03]    []  
33000     [5.37e-03, 6.29e-03]    [5.36e-03, 6.28e-03]    []  
34000     [3.78e-03, 4.33e-03]    [3.77e-03, 4.32e-03]    []  
35000     [3.47e-03, 3.94e-03]    [3.47e-03, 3.93e-03]    []  
36000     [1.36e-01, 3.00e-02]    [1.36e-01, 3.00e-02]    []  
37000     [2.61e-02, 2.30e-02]    [2.61e-02, 2.30e-02]    []  
38000     [1.70e-02, 1.43e-02]    [1.70e-02, 1.43e-02]    []  
39000     [2.46e-02, 9.53e-03]    [2.46e-02, 9.53e-03]    []  
40000     [3.44e-03, 3.40e-03]    [3.43e-03, 3.39e-03]    []  
41000     [4.86e-02, 4.04e-02]    [4.87e-02, 4.04e-02]    []  
42000     [1.36e-02, 6.29e-03]    [1.36e-02, 6.28e-03]    []  
43000     [2.03e-03, 2.01e-03]    [2.02e-03, 2.01e-03]    []  
44000     [9.10e-03, 4.28e-03]    [9.10e-03, 4.27e-03]    []  
45000     [4.18e-03, 3.27e-03]    [4.18e-03, 3.27e-03]    []  
46000     [3.36e-03, 3.02e-03]    [3.34e-03, 3.01e-03]    []  
47000     [3.53e-02, 5.22e-03]    [3.54e-02, 5.22e-03]    []  
48000     [1.52e-03, 1.65e-03]    [1.52e-03, 1.65e-03]    []  
49000     [6.96e-03, 3.91e-03]    [6.96e-03, 3.92e-03]    []  
50000     [1.97e-02, 5.57e-03]    [1.97e-02, 5.58e-03]    []  

Best model at step 48000:
  train loss: 3.17e-03
  test loss: 3.17e-03
  test metric: []

'train' took 159.127514 s

Compiling model...
'compile' took 0.493140 s

Training model...

2022-06-10 11:57:55.598997: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1750] (One-time warning): Not using XLA:CPU for cluster.

If you want XLA:CPU, do one of the following:

 - set the TF_XLA_FLAGS to include "--tf_xla_cpu_global_jit", or
 - set cpu_global_jit to true on this session's OptimizerOptions, or
 - use experimental_jit_scope, or
 - use tf.function(jit_compile=True).

To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a
proper command-line flag, not via TF_XLA_FLAGS).

Step      Train loss              Test loss               Test metric
50000     [1.97e-02, 5.57e-03]    [1.97e-02, 5.58e-03]    []  
51000     [2.79e-05, 3.94e-05]                                
52000     [9.64e-06, 1.49e-05]                                
53000     [4.94e-06, 5.72e-06]                                
INFO:tensorflow:Optimization terminated with:
  Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
  Objective function value: 0.000009
  Number of iterations: 2988
  Number of functions evaluations: 3271
53271     [4.13e-06, 5.36e-06]    [4.12e-06, 5.31e-06]    []  

Best model at step 53271:
  train loss: 9.48e-06
  test loss: 9.43e-06
  test metric: []

'train' took 72.661294 s

Saving loss history to /home/visual/dnolte/pinn/deepxde_demos/loss.dat ...
Saving training data to /home/visual/dnolte/pinn/deepxde_demos/train.dat ...
Saving test data to /home/visual/dnolte/pinn/deepxde_demos/test.dat ...

Output with tensorflow:

Using backend: tensorflow

2022-06-10 12:40:07.777235: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
/scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/skopt/sampler/sobol.py:246: UserWarning: The balance properties of Sobol' points require n to be a power of 2. 0 points have been previously generated, then: n=0+3002=3002. 
  warnings.warn("The balance properties of Sobol' points require "

Compiling model...
'compile' took 0.000374 s

2022-06-10 12:40:09.945946: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-10 12:40:10.348793: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-06-10 12:40:10.348835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6308 MB memory:  -> device: 0, name: Quadro RTX 4000, pci bus id: 0000:65:00.0, compute capability: 7.5

Training model...

2022-06-10 12:40:11.665303: I tensorflow/compiler/xla/service/service.cc:170] XLA service 0x5611e92dec30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-06-10 12:40:11.665334: I tensorflow/compiler/xla/service/service.cc:178]   StreamExecutor device (0): Quadro RTX 4000, Compute Capability 7.5
2022-06-10 12:40:11.685712: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:263] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2022-06-10 12:40:13.832684: I tensorflow/compiler/jit/xla_compilation_cache.cc:478] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

Step      Train loss              Test loss               Test metric
0         [2.95e+02, 1.71e+01]    [2.95e+02, 1.71e+01]    [] 
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
File ~/pinn/deepxde_demos/lotka-volterra_fwd.py:92, in <module>
     89 model = dde.Model(data, net)
     91 model.compile("adam", lr=0.001)
---> 92 losshistory, train_state = model.train(epochs=50000)
     93 model.compile("L-BFGS")
     94 losshistory, train_state = model.train()

File /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/deepxde/utils/internal.py:22, in timing.<locals>.wrapper(*args, **kwargs)
     19 @wraps(f)
     20 def wrapper(*args, **kwargs):
     21     ts = timeit.default_timer()
---> 22     result = f(*args, **kwargs)
     23     te = timeit.default_timer()
     24     print("%r took %f s\n" % (f.__name__, te - ts))

File /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/deepxde/model.py:534, in Model.train(self, epochs, batch_size, display_every, disregard_previous_best, callbacks, model_restore_path, model_save_path)
    532     if epochs is None:
    533         raise ValueError("No epochs for {}.".format(self.opt_name))
--> 534     self._train_sgd(epochs, display_every)
    535 self.callbacks.on_train_end()
    537 print("")

File /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/deepxde/model.py:551, in Model._train_sgd(self, epochs, display_every)
    546 self.callbacks.on_batch_begin()
    548 self.train_state.set_data_train(
    549     *self.data.train_next_batch(self.batch_size)
    550 )
--> 551 self._train_step(
    552     self.train_state.X_train,
    553     self.train_state.y_train,
    554     self.train_state.train_aux_vars,
    555 )
    557 self.train_state.epoch += 1
    558 self.train_state.step += 1

File /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/deepxde/model.py:461, in Model._train_step(self, inputs, targets, auxiliary_vars)
    459     self.sess.run(self.train_step, feed_dict=feed_dict)
    460 elif backend_name == "tensorflow":
--> 461     self.train_step(inputs, targets, auxiliary_vars)
    462 elif backend_name in ["pytorch", "paddle"]:
    463     # TODO: auxiliary_vars
    464     self.train_step(inputs, targets)

File /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File /scratch/local/dnolte/tensorflow/venv/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     52 try:
     53   ctx.ensure_initialized()
---> 54   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     55                                       inputs, attrs, num_outputs)
     56 except core._NotOkStatusException as e:
     57   if name is not None:

InternalError: Failed to load in-memory CUBIN: CUDA_ERROR_OUT_OF_MEMORY: out of memory [Op:__inference_train_step_2624]

lululxvi / deepxde

apply_output/input_transform exceeds GPU memory with tensorflow 2, but works with tf.compat.v1 #731