`param_variational_noise` causes recursion limit error in TF backend

mmz33 commented 1 month ago

I set param_variational_noise for almost all layers in the encoder. The network I am using has 4 encoder layers only and I am getting this python exception:

Exception RecursionError('maximum recursion depth exceeded while calling a Python object') in step 0.

Increasing the stack limit does not fix the issue because it seems there is an infinite loop in the gradient checkpointing logic as you can see in the log here: https://gist.github.com/mmz33/547a099d050983ab71c8fc7d5ca87c62

Here is the last grad checkpoint call before crashing. It says op .../Switch_994 so something is wrong.

........

  File "/nas/models/asr/mzeineldeen/setups/spanish/2023-10-20--att-i6/returnn/returnn/tf/util/gradient_checkpoint.py", line 116, in prepare_gradient_checkpointing.<locals>._set_wrapped_grad_func.<locals>._WrappedOp.__init__
    line: self._inputs = tuple(_map_tensor(x) for x in op.inputs)
    locals:
      self = <local> <returnn.tf.util.gradient_checkpoint.prepare_gradient_checkpointing.<locals>._set_wrapped_grad_func.<locals>._WrappedOp object at 0x7fab2a857b20>
      self._inputs = <local> !AttributeError: '_WrappedOp' object has no attribute '_inputs'
      tuple = <builtin> <class 'tuple'>
      _map_tensor = <local> <function prepare_gradient_checkpointing.<locals>._map_tensor at 0x7fab258c3370>
      x = <not found>
      op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch_994' type=Switch>
      op.inputs = <local> (<tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>, <tf.Tensor 'conv0/W_variational_noise/cond/pred_id:0' shape=() dtype=bool>)

It looks that it is trying to apply grad checkpoint for the Switch op and this loops indefinitely.

This is the returnn network: https://gist.github.com/mmz33/840033656b97b7e6e415c9a2b46fe75a

albertz commented 1 month ago

What TensorFlow version?

albertz commented 1 month ago

I'm not sure this is an infinite recursion error. Maybe the graph is just really big.

The question is, where are those <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch> created. You can inspect that by checking/printing op.traceback.

albertz commented 1 month ago

Is this about variational noise or about weight dropout? I see you have both. Does it still occur when you only have variational noise?

(And side remark: Does it make sense to have both?)

albertz commented 1 month ago

I tested this extended test case and that works:

```python def test_param_weight_dropout_and_variational_noise(): from returnn.tensor import Dim, batch_dim from returnn.tf.util.basic import print_graph_output, find_ops_with_tensor_input from returnn.tf.util.gradient_checkpoint import prepare_gradient_checkpointing time_dim = Dim(None, name="time") feature_dim = Dim(7, name="feature") classes_dim = Dim(13, name="classes") config = Config( { "param_dropout": 0.1, "param_variational_noise": 0.075, "extern_data": { "data": { "dim_tags": [batch_dim, time_dim, feature_dim], "time_dim_axis": 1, "feature_dim": feature_dim, "dtype": "float32", }, "classes": {"dim_tags": [batch_dim, time_dim], "sparse_dim": classes_dim, "dtype": "int32"}, }, } ) with make_scope() as session: network = TFNetwork(config=config, train_flag=True) # Do subnetwork by intention, to test when we have multiple variable scopes. network.construct_from_dict( { "output": { "class": "linear", "out_dim": classes_dim, "activation": "softmax", "from": "data", "loss": "ce", "target": "classes", } } ) loss = network.get_total_loss() prepare_gradient_checkpointing() opt = tf_compat.v1.train.GradientDescentOptimizer(learning_rate=0.1) opt_op = opt.minimize(loss) print("optimizer:") print_graph_output(opt_op) tf_log_dir = tempfile.mkdtemp() print("TF log dir:", tf_log_dir) writer = tf_compat.v1.summary.FileWriter(logdir=tf_log_dir, graph=session.graph, session=session) params = network.get_params_list() print("params:", params) assert len(params) == 2 # weights and bias for param in params: print("param:", param) ops = find_ops_with_tensor_input(param, fetches=opt_op) print("param graph:") print_graph_output(ops) # There can be multiple ops due to gradient checkpointing. assert ( 1 <= len(ops) and all("_variational_noise/" in op.name or "/ResourceApply" in op.name for op in ops) and any("_variational_noise/" in op.name for op in ops) ), f"ops: {ops}" network.initialize_params(session=session) run_metadata = tf_compat.v1.RunMetadata() run_options = tf_compat.v1.RunOptions(trace_level=tf_compat.v1.RunOptions.FULL_TRACE) session.run( opt_op, feed_dict=make_feed_dict(network.extern_data), options=run_options, run_metadata=run_metadata ) writer.add_run_metadata(run_metadata, tag="step_0") writer.close() print("TF log dir:", tf_log_dir) ```

Edit Sorry, actually, it does not. Weight dropout is not applied here at all. This check here does not apply:

            if (
                param_dropout
                and param.dtype.is_floating
                and isinstance(param, tf.Variable)
                and param.shape.ndims >= param_dropout_min_ndim
            ):

because at that point, param is not a tf.Variable anymore...

So I will extend the check. But that's just a separate additional bug. It still means there might be a problem with variational noise only.

mmz33 commented 1 month ago

Is this about variational noise or about weight dropout?

It is about variational noise. I used before only weight dropout and this worked fine.

Does it still occur when you only have variational noise?

Yes it still occurs.

mmz33 commented 1 month ago

What TensorFlow version?

I use TF 2.13

albertz commented 1 month ago

I pushed some change where I avoid the recursion and do flat construction instead. So there should never be a maximum recursion depth exceeded error. However, logically, nothing should be different from before. So even before, with a high enough recursion limit, it should have worked. Not sure if this really changes sth for you now (except that you don't get the recursion error, but instead it might just hang and slowly runs OOM?). But can you just try? My hypothesis is still that the graph might just be very big.

mmz33 commented 1 month ago

I tried with master branch. It fails due to op copy error:

....

  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 148, in <genexpr>
    line: self._inputs = tuple(_map_tensor(x) for x in op.inputs)
    locals:
      self = <not found>
      self._inputs = <not found>
      tuple = <builtin> <class 'tuple'>
      _map_tensor = <local> <function prepare_gradient_checkpointing.<locals>._map_tensor at 0x7f12907f5ab0>
      x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
      op = <not found>
      op.inputs = <not found>
  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 128, in prepare_gradient_checkpointing.<locals>._map_tensor
    line: x_op_copy = _copy_op(x.op)
    locals:
      x_op_copy = <not found>
      _copy_op = <local> <function prepare_gradient_checkpointing.<locals>._copy_op at 0x7f12907f5990>
      x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
      x.op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 93, in prepare_gradient_checkpointing.<locals>._copy_op
    line: raise _DeepCopyError(op)
    locals:
      _DeepCopyError = <local> <class 'returnn.tf.util.gradient_checkpoint.prepare_gradient_checkpointing.<locals>._DeepCopyError'>
      op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
_DeepCopyError: deep copy err: name: "conv0/W_variational_noise/cond/ReadVariableOp/Switch"

albertz commented 1 month ago

Ah I think I know the problem. Can you post the full log/error (it shouldn't be so long now)? Btw, did you use an earlier TF version before?

albertz commented 1 month ago

I pushed again another small change. Can you test?

rwth-i6 / returnn

`param_variational_noise` causes recursion limit error in TF backend #1616