Closed JackTemaki closed 2 years ago
How do you compile the graph? There is no real bug here, neither in TensorFlow nor in RETURNN.
When you compile the graph for GPU, it is correct to use NCHW because that is faster. When you compile the graph for CPU, it would automatically use NHWC.
Oh you are right, this is an issue with the limitations of the tf_compile_graph.py tool, and not about the RETURNN graph. I will extend this and make a PR, there is no issue here.
Okay I am a little bit confused here, the version I had this issue with is from 1. December, and there it correctly switches between NCHW and NHWC for GPU and CPU respectively. I wanted to add a flag to compile_tf_graph.py to specifically define a target device, but now I found that the current master does always produce NHWC.
What behavior version do you use?
I tested this with an old config, so none.
Ok, before those changes, the logic to make the output of PoolLayer
NCHW was:
if tf_util.is_gpu_available_in_session() and use_channel_first:
...
And use_channel_first=False
by default.
Note that ConvLayer
was a bit different (inconsistent): The output is NCHW when:
if tf_util.is_gpu_available_in_session() and (auto_use_channel_first or input_data.is_batch_feature_major):
...
with auto_use_channel_first=False
by default.
Do you have use_channel_first=True
set explicitly in your config on the pool layer options? I wonder how you could get NCHW in the older RETURNN version. (This was the case, as I understood you, right?)
I did not set use_channel_first=True
, but still got the NCHW order.
Ah, but the version I used has use_channel_first=True,
as default. You changed this exactly at the day when I updated the RETURNN version.
Okay so there is no bug here, I just updated RETURNN in the middle of your restructuring work. I will fix this manually in my Sisyphus Graph, and as this is only a testing setup it will re-run with another RETURNN commit and updated behavior version anyways.
Ah I see. Yea, first I thought it might make sense to always do this. Then I reconsidered that it should be maybe only be done automatically with new behavior version to not cause any troubles on older setups which depend on some specific order of axes.
But in any case, then what you need is just this flag for tf_compile_graph.
As this was part of a RASR Job I do not have a stacktrace, but I got this error when running from a compiled graph:
which means there is the necessity to restrict the pooling layer to some specific axis layout when using average as pooling method. On GPU everything was fine of course.