Closed albertz closed 2 years ago
@Zettelkasten @mmz33 @robin-p-schmitt maybe you have some ideas, or comments?
Note that I already realized this as a problem some time ago, but now again when thinking about specifying the output spatial dim (out_spatial_dim
#597) and the exact behavior.
When not being optimized (e.g. simply optimize_move_layers_out=False
) and then accumulating the output of MaskedComputationLayer
, you would get shape [T,B,D], and never actually the shape [T',B,D].
out_spatial_dim
should in all cases refer to T', even when T' would actually never exist? So basically out_spatial_dim
would be ignored inside the loop, and only used outside a rec loop?
The new dim T' would somehow have a reference to T. derived_from_tag
is obvious, but we need more, esp we need the mask, or the indices for unmasking.
It's a bit like when e.g. ConvLayer
did downsampling via striding. Then T' = T / 2. And this could be formulated as a mask as well.
Now we can see the unmasking as a sparse-to-dense operation.
However, this is still ambiguous. In sparse-to-dense, we would set 0 for all other frames. But this is rarely what we would want here. Specifically, when reproducing the output of a MaskedComputationLayer
, we would have a very specific behavior, namely that the previous masked frame would be copied.
So, when we want to have automatic unmasking of T' to T, when such tensors are combined somewhere, e.g. via Data.get_common_data
and then Data.copy_compatible_to
, we don't just need the mask but we also need the kind of unmasking, i.e. how to fill the other frames.
We could use the derived_from_op
mechanism, via a special op kind "mask"
, and store the mask in the attributes, along with unmask_type="left"
or so, to reflect that we want this specific behavior when unmasking.
This could later be extended to be able to represent sparse data via unmask_type="fill"
and unmask_fill_value
.
The mask itself would be a Data
instance. In our case, it would be of shape [B,T] with dtype bool. However, the shape is probably arbitrary. Although it probably should contain the original dim T.
Or as mentioned, instead of a mask, we could also store the indices, i.e. Data
of shape [B,T'] pointing to T.
See RETURNN principles.
When defining
MaskedComputationLayer
inside a rec loop and not thinking about rec automatic optimization, the behavior is clear. It copies the prev output (and prev state) whenever the mask is False, otherwise it uses the current output.In this line of thought, there should be no need for
UnmaskLayer
.However, the current situation is, when it is optimized out, a tensor of shape [B,D] inside would not just become [T,B,D] where T is the number of rec iterations, but instead it becomes [T',B,D] where T' = sum(mask). And some follow-up operation at some point requires the [T,B,D] shape (maybe because it combines it with other tensors from the rec loop), and thus the user explicitly must use the
UnmaskLayer
at some point after theMaskedComputationLayer
.This violates the RETURNN principle that the user should not need to think about rec automatic optimization.
What are possible solutions?
I don't really have a good idea so far. It should be opaque.
Maybe the new dim tag for T' could have some special flag that it is a masked subset of T, and whenever it is going to be combined with the other dim tag, it would automatically do the unmasking.