Open vieting opened 2 years ago
As this came up again, I had a look at the issues. They occur when running from a network dict, all runs up until there (including the net dict creation) work.
In RandIntLayer.get_out_data_from_opts()
, the dim tags inferred from shape
are
>>> dim_tags
[Dim{B}, Dim{'3*time:data'[B]}]
>>> dim_tags[1].dyn_size
<tf.Tensor 'mul_randint/mul:0' shape=(?,) dtype=int32>
However, the dyn_size
is removed in get_for_batch_ctx
.
>>> dim_tags[1].get_for_batch_ctx(batch, ctx).dyn_size
None
I'm not familiar enough with the details there to have a clear idea why this is the case. This leads to the error subsequently in RandIntLayer.__init__()
when calling .get_dim_value()
on that dim tag. @albertz do you have an idea how to fix this?
The other two failing tests are related.
I cannot see the tests anymore. Can you post the relevant exceptions here?
What is batch
and ctx
in your example?
Probably some complete_dyn_size
is missing here.
You can currently see the tests here from the last commit in main.
>>> ctx
None
>>> batch
BatchInfo{B}
>>> vars(batch)
{'_descendants_by_beam_name': {},
'_dim': None,
'_global_beam_dims_by_beam_name': {},
'_global_descendants_by_virtual_dims': {(GlobalBatchDim{B},): BatchInfo{B}},
'_global_padded_dims_by_dim_tag': {},
'_packed_dims_by_dim_tag': {},
'base': None,
'descendants': [],
'virtual_dims': [GlobalBatchDim{B}]}
I cannot see a difference regarding these between the working _run_torch_returnn_drop_in()
and the failing _run_returnn_standalone_net_dict()
.
Error:
ERROR: test_layers.test_randint_dynamic
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/runner/.local/lib/python3.8/site-packages/nose/case.py", line 198, in TestBase.runTest
line: self.test(*self.arg)
locals:
self = <local> test_layers.test_randint_dynamic
self.test = <local> <function test_randint_dynamic at 0x7f79b0d533a0>
self.arg = <local> ()
File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/tests/test_layers.py", line 46, in test_randint_dynamic
line: verify_torch_and_convert_to_returnn(model_func, inputs=x, inputs_data_kwargs={
"shape": (None, n_feat), "batch_dim_axis": 0, "time_dim_axis": 1, "feature_dim_axis": 2})
locals:
verify_torch_and_convert_to_returnn = <global> <function verify_torch_and_convert_to_returnn at 0x7f7990de6430>
model_func = <local> <function test_randint_dynamic.<locals>.model_func at 0x7f7990cdf430>
inputs = <not found>
x = <local> array([[[ 0.49671414, -0.1382643 , 0.64768857, 1.5[23](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:24)0298 ,
-0.23415338, -0.23413695, 1.5792128 ],
[ 0.7674347 , -0.46947438, 0.54256004, -0.46341768,
-0.46572974, 0.[24](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:25)196227, -1.9132802 ],
[-1.7249179 , -0.5622875 , -1.0128311 , 0.31424734,
-0.9080...
inputs_data_kwargs = <not found>
n_feat = <local> 7
File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/converter.py", line 436, in verify_torch_and_convert_to_returnn
line: converter.run()
locals:
converter = <local> <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39[25](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:26)0>
converter.run = <local> <bound method Converter.run of <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>>
File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/converter.py", line 143, in Converter.run
line: self._run_returnn_standalone_net_dict()
locals:
self = <local> <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>
self._run_returnn_standalone_net_dict = <local> <bound method Converter._run_returnn_standalone_net_dict of <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>>
File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/converter.py", line 353, in Converter._run_returnn_standalone_net_dict
line: network.construct_from_dict(self._returnn_net_dict)
...
File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/network.py", line 1189, in TFNetwork._create_layer
line: layer = layer_class(**layer_desc)
locals:
layer = <not found>
layer_class = <local> <class 'returnn.tf.layers.basic.RandIntLayer'>
layer_desc = <local> {'shape': (Dim{B}, Dim{'3*time:data'[B]}), 'maxval': <CastLayer 'mul_randint_Cast' out_type=Data{[], dtype='int64'}>, 'minval': 0, 'dtype': 'int64', '_network': <TFNetwork 'root' train=False>, '_name': 'mul_randint', 'sources': [<SourceLayer 'data' out_type=Data{[B,T|'time:data'[B],F|F'feature:da..., len = 10
File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/layers/basic.py", line [27](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:28)35, in RandIntLayer.__init__
line: shape_ = [
d.get_for_batch_ctx(batch, self.network.control_flow_ctx).get_dim_value()
for d in self.output.dim_tags]
locals:
shape_ = <not found>
d = <not found>
d.get_for_batch_ctx = <not found>
batch = <local> BatchInfo{B}
self = <local> <RandIntLayer 'mul_randint' out_type=Data{[B,T|'3*time:data'[B]], dtype='int64'}>
self.network = <local> <TFNetwork 'root' train=False>
self.network.control_flow_ctx = <local> None
get_dim_value = <not found>
self.output = <local> Data{'mul_randint_output', [B,T|'3*time:data'[B]], dtype='int64'}
self.output.dim_tags = <local> (Dim{B}, Dim{'3*time:data'[B]})
File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/layers/basic.py", line 2736, in <listcomp>
line: d.get_for_batch_ctx(batch, self.network.control_flow_ctx).get_dim_value()
locals:
d = <local> Dim{'3*time:data'[B]}
d.get_for_batch_ctx = <local> <bound method Dim.get_for_batch_ctx of Dim{'3*time:data'[B]}>
batch = <local> BatchInfo{B}
self = <local> <RandIntLayer 'mul_randint' out_type=Data{[B,T|'3*time:data'[B]], dtype='int64'}>
self.network = <local> <TFNetwork 'root' train=False>
self.network.control_flow_ctx = <local> None
get_dim_value = <not found>
File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/util/data.py", line 1191, in Dim.get_dim_value
line: raise Exception('%s: need placeholder, self.dimension or self.dyn_size for dim value' % self)
locals:
Exception = <builtin> <class 'Exception'>
self = <local> Dim{'3*time:data'[B]}
Exception: Dim{'3*time:data'[B]}: need placeholder, self.dimension or self.dyn_size for dim value
I wonder, in get_dim_value
we already call complete_dyn_size
, so why is it not available? It would maybe be helpful to debug-step through it.
In the first run, dim_tags[1].batch = None
while in the second run we get dim_tags[1].batch = BatchInfo{B}
. So when calling dim_tags[1].get_for_batch_ctx(batch, ctx)
, it is that batch == dim_tags[1].batch
is True
which leads to different behavior in get_for_batch_ctx
.
The differences in get_for_batch_ctx()
are:
This is executed in the second run:
315 self._validate_in_current_graph()
--> 316 self._maybe_update()
Then later, same_base.batch == batch
evaluates to False
in the second run because their virtualdims
are not the same.
>>> same_base.batch.virtual_dims[0].size
<tf.Tensor 'extern_data/placeholders/batch_dim:0' shape=() dtype=int32>
>>> batch.virtual_dims[0].size
<tf.Tensor 'extern_data/placeholders/batch_dim:0' shape=() dtype=int32>
>>> same_base.batch.virtual_dims[0].size == batch.virtual_dims[0].size
False
so not same_base
is returned as in the first run.
The difference in .batch
is already present in the input shape
which comes from the network dict.
https://github.com/rwth-i6/returnn/commit/4978ecb7794400be53c32ebde0520e182c42ed27 affects the errors here which potentially further helps to track the issue down, see the test cases of the latest commit [here]().
For test_randint_dynamic
and test_contrastive_loss
, we now get
ValueError: Tensor("mul_randint/Max:0", shape=(), dtype=int32) must be from the same graph as Tensor("extern_data/placeholders/batch_dim:0", shape=(), dtype=int32) (graphs are <tensorflow.python.framework.ops.Graph object at 0x7f5fcaaaa100> and <tensorflow.python.framework.ops.Graph object at 0x7f5fca768160>)
which is the same as observed in https://github.com/rwth-i6/returnn/issues/1224.
For test_index_merged_dim
, it is
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'extern_data/placeholders/data/data_dim0_size' with dtype int32 and shape [?]
[[node extern_data/placeholders/data/data_dim0_size (defined at home/runner/.local/lib/python3.8/site-packages/returnn/tf/util/data.py:5801) ]]
There are failing test cases for the last commit on the main branch. Since the commit is only a minor change to the readme, it is very likely that recent updates of RETURNN cause the failures.
See linked tests in https://github.com/rwth-i6/pytorch-to-returnn/commit/aaa35f9ff467633ca5ab78118f68f6647b019fe3