Closed mmz33 closed 1 month ago
What TensorFlow version?
I'm not sure this is an infinite recursion error. Maybe the graph is just really big.
The question is, where are those <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
created. You can inspect that by checking/printing op.traceback
.
Is this about variational noise or about weight dropout? I see you have both. Does it still occur when you only have variational noise?
(And side remark: Does it make sense to have both?)
I tested this extended test case and that works:
Edit Sorry, actually, it does not. Weight dropout is not applied here at all. This check here does not apply:
if (
param_dropout
and param.dtype.is_floating
and isinstance(param, tf.Variable)
and param.shape.ndims >= param_dropout_min_ndim
):
because at that point, param
is not a tf.Variable
anymore...
So I will extend the check. But that's just a separate additional bug. It still means there might be a problem with variational noise only.
Is this about variational noise or about weight dropout?
It is about variational noise. I used before only weight dropout and this worked fine.
Does it still occur when you only have variational noise?
Yes it still occurs.
What TensorFlow version?
I use TF 2.13
I pushed some change where I avoid the recursion and do flat construction instead. So there should never be a maximum recursion depth exceeded error. However, logically, nothing should be different from before. So even before, with a high enough recursion limit, it should have worked. Not sure if this really changes sth for you now (except that you don't get the recursion error, but instead it might just hang and slowly runs OOM?). But can you just try? My hypothesis is still that the graph might just be very big.
I tried with master branch. It fails due to op copy error:
....
File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 148, in <genexpr>
line: self._inputs = tuple(_map_tensor(x) for x in op.inputs)
locals:
self = <not found>
self._inputs = <not found>
tuple = <builtin> <class 'tuple'>
_map_tensor = <local> <function prepare_gradient_checkpointing.<locals>._map_tensor at 0x7f12907f5ab0>
x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
op = <not found>
op.inputs = <not found>
File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 128, in prepare_gradient_checkpointing.<locals>._map_tensor
line: x_op_copy = _copy_op(x.op)
locals:
x_op_copy = <not found>
_copy_op = <local> <function prepare_gradient_checkpointing.<locals>._copy_op at 0x7f12907f5990>
x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
x.op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 93, in prepare_gradient_checkpointing.<locals>._copy_op
line: raise _DeepCopyError(op)
locals:
_DeepCopyError = <local> <class 'returnn.tf.util.gradient_checkpoint.prepare_gradient_checkpointing.<locals>._DeepCopyError'>
op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
_DeepCopyError: deep copy err: name: "conv0/W_variational_noise/cond/ReadVariableOp/Switch"
Ah I think I know the problem. Can you post the full log/error (it shouldn't be so long now)? Btw, did you use an earlier TF version before?
I pushed again another small change. Can you test?
I set
param_variational_noise
for almost all layers in the encoder. The network I am using has 4 encoder layers only and I am getting this python exception:Increasing the stack limit does not fix the issue because it seems there is an infinite loop in the gradient checkpointing logic as you can see in the log here: https://gist.github.com/mmz33/547a099d050983ab71c8fc7d5ca87c62
Here is the last grad checkpoint call before crashing. It says op
.../Switch_994
so something is wrong.It looks that it is trying to apply grad checkpoint for the Switch op and this loops indefinitely.
This is the returnn network: https://gist.github.com/mmz33/840033656b97b7e6e415c9a2b46fe75a