E.g. if you have a sparse [B,T] which is then actually virtually [B,T,F]. and then you e.g. reduce_max the T axis, I think it will reduce on the vocab ids currently. But instead it should make a dense [B,T,F] where F is just the max of the one-hot vectors before.
And probably also in many other cases. Like also DotLayer in #741.
E.g. if you have a sparse
[B,T]
which is then actually virtually[B,T,F]
. and then you e.g.reduce_max
theT
axis, I think it will reduce on the vocab ids currently. But instead it should make a dense[B,T,F]
whereF
is just the max of the one-hot vectors before. And probably also in many other cases. Like alsoDotLayer
in #741.Originally posted by @Zettelkasten in https://github.com/rwth-i6/returnn/issues/741#issuecomment-963053472