rwth-i6 / pytorch-to-returnn

Make PyTorch code runnable within RETURNN
3 stars 6 forks source link

Failing Test Cases #125

Open vieting opened 2 years ago

vieting commented 2 years ago

There are failing test cases for the last commit on the main branch. Since the commit is only a minor change to the readme, it is very likely that recent updates of RETURNN cause the failures.

See linked tests in

vieting commented 1 year ago

As this came up again, I had a look at the issues. They occur when running from a network dict, all runs up until there (including the net dict creation) work.

In RandIntLayer.get_out_data_from_opts(), the dim tags inferred from shape are

>>> dim_tags
[Dim{B}, Dim{'3*time:data'[B]}]
>>> dim_tags[1].dyn_size
<tf.Tensor 'mul_randint/mul:0' shape=(?,) dtype=int32>

However, the dyn_size is removed in get_for_batch_ctx.

>>> dim_tags[1].get_for_batch_ctx(batch, ctx).dyn_size

I'm not familiar enough with the details there to have a clear idea why this is the case. This leads to the error subsequently in RandIntLayer.__init__() when calling .get_dim_value() on that dim tag. @albertz do you have an idea how to fix this?

The other two failing tests are related.

albertz commented 1 year ago

I cannot see the tests anymore. Can you post the relevant exceptions here?

What is batch and ctx in your example?

Probably some complete_dyn_size is missing here.

vieting commented 1 year ago

You can currently see the tests here from the last commit in main.

>>> ctx                                                                                                                                                                                                                         
>>> batch                                                                                                                                                                                                                       
>>> vars(batch)                                                                                                                                                                                                                 
{'_descendants_by_beam_name': {},
 '_dim': None,
 '_global_beam_dims_by_beam_name': {},
 '_global_descendants_by_virtual_dims': {(GlobalBatchDim{B},): BatchInfo{B}},
 '_global_padded_dims_by_dim_tag': {},
 '_packed_dims_by_dim_tag': {},
 'base': None,
 'descendants': [],
 'virtual_dims': [GlobalBatchDim{B}]}

I cannot see a difference regarding these between the working _run_torch_returnn_drop_in() and the failing _run_returnn_standalone_net_dict().

albertz commented 1 year ago


ERROR: test_layers.test_randint_dynamic
Traceback (most recent call last):
  File "/home/runner/.local/lib/python3.8/site-packages/nose/", line 198, in TestBase.runTest
    line: self.test(*self.arg)
      self = <local> test_layers.test_randint_dynamic
      self.test = <local> <function test_randint_dynamic at 0x7f79b0d533a0>
      self.arg = <local> ()
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/tests/", line 46, in test_randint_dynamic
    line: verify_torch_and_convert_to_returnn(model_func, inputs=x, inputs_data_kwargs={
            "shape": (None, n_feat), "batch_dim_axis": 0, "time_dim_axis": 1, "feature_dim_axis": 2})
      verify_torch_and_convert_to_returnn = <global> <function verify_torch_and_convert_to_returnn at 0x7f7990de6430>
      model_func = <local> <function test_randint_dynamic.<locals>.model_func at 0x7f7990cdf430>
      inputs = <not found>
      x = <local> array([[[ 0.49671414, -0.1382643 ,  0.64768857,  1.5[23]( ,
                           -0.23415338, -0.23413695,  1.5792128 ],
                          [ 0.7674347 , -0.46947438,  0.54256004, -0.46341768,
                           -0.46572974,  0.[24](, -1.9132802 ],
                          [-1.7249179 , -0.5622875 , -1.0128311 ,  0.31424734,
      inputs_data_kwargs = <not found>
      n_feat = <local> 7
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/", line 436, in verify_torch_and_convert_to_returnn
      converter = <local> <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39[25](> = <local> <bound method of <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>>
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/", line 143, in
    line: self._run_returnn_standalone_net_dict()
      self = <local> <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>
      self._run_returnn_standalone_net_dict = <local> <bound method Converter._run_returnn_standalone_net_dict of <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>>
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/", line 353, in Converter._run_returnn_standalone_net_dict
    line: network.construct_from_dict(self._returnn_net_dict)
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/", line 1189, in TFNetwork._create_layer
    line: layer = layer_class(**layer_desc)
      layer = <not found>
      layer_class = <local> <class ''>
      layer_desc = <local> {'shape': (Dim{B}, Dim{'3*time:data'[B]}), 'maxval': <CastLayer 'mul_randint_Cast' out_type=Data{[], dtype='int64'}>, 'minval': 0, 'dtype': 'int64', '_network': <TFNetwork 'root' train=False>, '_name': 'mul_randint', 'sources': [<SourceLayer 'data' out_type=Data{[B,T|'time:data'[B],F|F'feature:da..., len = 10
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/layers/", line [27](, in RandIntLayer.__init__
    line: shape_ = [
            for d in self.output.dim_tags]
      shape_ = <not found>
      d = <not found>
      d.get_for_batch_ctx = <not found>
      batch = <local> BatchInfo{B}
      self = <local> <RandIntLayer 'mul_randint' out_type=Data{[B,T|'3*time:data'[B]], dtype='int64'}> = <local> <TFNetwork 'root' train=False> = <local> None
      get_dim_value = <not found>
      self.output = <local> Data{'mul_randint_output', [B,T|'3*time:data'[B]], dtype='int64'}
      self.output.dim_tags = <local> (Dim{B}, Dim{'3*time:data'[B]})
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/layers/", line 2736, in <listcomp>
    line: d.get_for_batch_ctx(batch,
      d = <local> Dim{'3*time:data'[B]}
      d.get_for_batch_ctx = <local> <bound method Dim.get_for_batch_ctx of Dim{'3*time:data'[B]}>
      batch = <local> BatchInfo{B}
      self = <local> <RandIntLayer 'mul_randint' out_type=Data{[B,T|'3*time:data'[B]], dtype='int64'}> = <local> <TFNetwork 'root' train=False> = <local> None
      get_dim_value = <not found>
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/util/", line 1191, in Dim.get_dim_value
    line: raise Exception('%s: need placeholder, self.dimension or self.dyn_size for dim value' % self)
      Exception = <builtin> <class 'Exception'>
      self = <local> Dim{'3*time:data'[B]}
Exception: Dim{'3*time:data'[B]}: need placeholder, self.dimension or self.dyn_size for dim value
albertz commented 1 year ago

I wonder, in get_dim_value we already call complete_dyn_size, so why is it not available? It would maybe be helpful to debug-step through it.

vieting commented 1 year ago

In the first run, dim_tags[1].batch = None while in the second run we get dim_tags[1].batch = BatchInfo{B}. So when calling dim_tags[1].get_for_batch_ctx(batch, ctx), it is that batch == dim_tags[1].batch is True which leads to different behavior in get_for_batch_ctx.

The differences in get_for_batch_ctx() are:

This is executed in the second run:

    315       self._validate_in_current_graph()
--> 316       self._maybe_update()

Then later, same_base.batch == batch evaluates to False in the second run because their virtualdims are not the same.

>>> same_base.batch.virtual_dims[0].size
<tf.Tensor 'extern_data/placeholders/batch_dim:0' shape=() dtype=int32>
>>> batch.virtual_dims[0].size
<tf.Tensor 'extern_data/placeholders/batch_dim:0' shape=() dtype=int32>
>>> same_base.batch.virtual_dims[0].size == batch.virtual_dims[0].size

so not same_base is returned as in the first run.

The difference in .batch is already present in the input shape which comes from the network dict.

vieting commented 1 year ago affects the errors here which potentially further helps to track the issue down, see the test cases of the latest commit [here]().

For test_randint_dynamic and test_contrastive_loss, we now get

ValueError: Tensor("mul_randint/Max:0", shape=(), dtype=int32) must be from the same graph as Tensor("extern_data/placeholders/batch_dim:0", shape=(), dtype=int32) (graphs are <tensorflow.python.framework.ops.Graph object at 0x7f5fcaaaa100> and <tensorflow.python.framework.ops.Graph object at 0x7f5fca768160>)

which is the same as observed in

For test_index_merged_dim, it is

tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'extern_data/placeholders/data/data_dim0_size' with dtype int32 and shape [?]
     [[node extern_data/placeholders/data/data_dim0_size (defined at home/runner/.local/lib/python3.8/site-packages/returnn/tf/util/ ]]