rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

AvgPooling not implemented for all axis shapes on CPU #829

Closed JackTemaki closed 2 years ago

JackTemaki commented 2 years ago

As this was part of a RASR Job I do not have a stacktrace, but I got this error when running from a compiled graph:

error calling Session::Run (target: ): Invalid argument: Default AvgPoolingOp only supports NHWC on device type CPU
         [[{{node blstms/2/pool_0/avg_pool}}]]

which means there is the necessity to restrict the pooling layer to some specific axis layout when using average as pooling method. On GPU everything was fine of course.

albertz commented 2 years ago

How do you compile the graph? There is no real bug here, neither in TensorFlow nor in RETURNN.

When you compile the graph for GPU, it is correct to use NCHW because that is faster. When you compile the graph for CPU, it would automatically use NHWC.

JackTemaki commented 2 years ago

Oh you are right, this is an issue with the limitations of the tf_compile_graph.py tool, and not about the RETURNN graph. I will extend this and make a PR, there is no issue here.

JackTemaki commented 2 years ago

Okay I am a little bit confused here, the version I had this issue with is from 1. December, and there it correctly switches between NCHW and NHWC for GPU and CPU respectively. I wanted to add a flag to compile_tf_graph.py to specifically define a target device, but now I found that the current master does always produce NHWC.

albertz commented 2 years ago

What behavior version do you use?

JackTemaki commented 2 years ago

I tested this with an old config, so none.

albertz commented 2 years ago

823, #814, #792, #789 are probably related.

albertz commented 2 years ago

Ok, before those changes, the logic to make the output of PoolLayer NCHW was:

if tf_util.is_gpu_available_in_session() and use_channel_first:
 ...

And use_channel_first=False by default.

Note that ConvLayer was a bit different (inconsistent): The output is NCHW when:

if tf_util.is_gpu_available_in_session() and (auto_use_channel_first or input_data.is_batch_feature_major):
  ...

with auto_use_channel_first=False by default.

albertz commented 2 years ago

Do you have use_channel_first=True set explicitly in your config on the pool layer options? I wonder how you could get NCHW in the older RETURNN version. (This was the case, as I understood you, right?)

JackTemaki commented 2 years ago

I did not set use_channel_first=True, but still got the NCHW order.

Ah, but the version I used has use_channel_first=True, as default. You changed this exactly at the day when I updated the RETURNN version.

JackTemaki commented 2 years ago

Okay so there is no bug here, I just updated RETURNN in the middle of your restructuring work. I will fix this manually in my Sisyphus Graph, and as this is only a testing setup it will re-run with another RETURNN commit and updated behavior version anyways.

albertz commented 2 years ago

Ah I see. Yea, first I thought it might make sense to always do this. Then I reconsidered that it should be maybe only be done automatically with new behavior version to not cause any troubles on older setups which depend on some specific order of axes.

But in any case, then what you need is just this flag for tf_compile_graph.