rwth-i6 / pytorch-to-returnn

Make PyTorch code runnable within RETURNN
3 stars 6 forks source link

Fix for current RETURNN version #78

Closed vieting closed 2 years ago

vieting commented 2 years ago

https://github.com/rwth-i6/returnn/commit/2a1b840319b59f890958ccc8cad9639e3fea5efd introduces an issue for pytorch-to-returnn for modules which use a layer with reuse_params=True.

vieting commented 2 years ago

Furthermore, https://github.com/rwth-i6/returnn/commit/dbaa97a9d9cfd90d073d743786b6a233cdb5eeeb introduces an issue which I have not yet fixed. When unflattening the batch dim axis, all resulting axes now get a batch dim dim_tag, which was previously not the case.

vieting commented 2 years ago

When unflattening the batch dim axis, all resulting axes now get a batch dim dim_tag, which was previously not the case.

@albertz, is this desired? I think there is not much we can do about this from pytorch-to-returnn side.

albertz commented 2 years ago

I think it's easier to have a separate issue here to discuss the problem. A PR should only be about the fix.

Despite an issue here, can you also make maybe an issue on RETURNN side with a simple net dict which fails now but which should work? Or even multiple issues when there seem to be multiple problems.

albertz commented 2 years ago

You created the issue about SplitDimsLayer (https://github.com/rwth-i6/returnn/issues/908).

But didn't you also mention a separate issue about reuse_params?

vieting commented 2 years ago

But didn't you also mention a separate issue about reuse_params?

Yes, the change in this PR should fix that.

albertz commented 2 years ago

It looks like there is another problem with batch norm:

  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/layers/base.py", line 1551, in LayerBase.batch_norm.<locals>._calc_batch_norm
    line: updated_sample_mean = tf_compat.v1.assign_add(sample_mean, (mean_cur_batch - sample_mean) * momentum)
    locals:
      updated_sample_mean = <not found>
      tf_compat = <global> <module 'returnn.tf.compat' from '/home/runner/.local/lib/python3.8/site-packages/returnn/tf/compat.py'>
      tf_compat.v1 = <global> <module 'tensorflow._api.v2.compat.v1' from '/home/runner/.local/lib/python3.8/site-packages/tensorflow/_api/v2/compat/v1/__init__.py'>
      tf_compat.v1.assign_add = <global> <function assign_add at 0x7f5bb980f550>
      sample_mean = <local> <tf.Variable 'BatchNorm1d/batch_norm/mean:0' shape=(1, 11, 1) dtype=float32>
      mean_cur_batch = <local> <tf.Tensor 'BatchNorm1d/batch_norm/moments/Squeeze:0' shape=(11,) dtype=float32>
      momentum = <local> 0.1
albertz commented 2 years ago

Merged now. Although I did not really understand why this worked before and now not anymore. But I did not really look into it.

albertz commented 2 years ago

Ah, I think the problem is that I don't put the params into the layer.params dict anymore when they are shared and originate in another layer. So yes, then this fix should be fine.

vieting commented 2 years ago

I think the problem is that I don't put the params into the layer.params dict anymore when they are shared and originate in another layer.

Yes exactly, this is also my understanding.