rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

`MaskedComputationLayer` is violating the principle that the user should not need to think about rec automatic optimization #769

Closed albertz closed 2 years ago

albertz commented 2 years ago

See RETURNN principles.

When defining MaskedComputationLayer inside a rec loop and not thinking about rec automatic optimization, the behavior is clear. It copies the prev output (and prev state) whenever the mask is False, otherwise it uses the current output.

In this line of thought, there should be no need for UnmaskLayer.

However, the current situation is, when it is optimized out, a tensor of shape [B,D] inside would not just become [T,B,D] where T is the number of rec iterations, but instead it becomes [T',B,D] where T' = sum(mask). And some follow-up operation at some point requires the [T,B,D] shape (maybe because it combines it with other tensors from the rec loop), and thus the user explicitly must use the UnmaskLayer at some point after the MaskedComputationLayer.

This violates the RETURNN principle that the user should not need to think about rec automatic optimization.


What are possible solutions?

I don't really have a good idea so far. It should be opaque.

Maybe the new dim tag for T' could have some special flag that it is a masked subset of T, and whenever it is going to be combined with the other dim tag, it would automatically do the unmasking.

albertz commented 2 years ago

@Zettelkasten @mmz33 @robin-p-schmitt maybe you have some ideas, or comments?

albertz commented 2 years ago

Note that I already realized this as a problem some time ago, but now again when thinking about specifying the output spatial dim (out_spatial_dim #597) and the exact behavior.

When not being optimized (e.g. simply optimize_move_layers_out=False) and then accumulating the output of MaskedComputationLayer, you would get shape [T,B,D], and never actually the shape [T',B,D].

out_spatial_dim should in all cases refer to T', even when T' would actually never exist? So basically out_spatial_dim would be ignored inside the loop, and only used outside a rec loop?

albertz commented 2 years ago

The new dim T' would somehow have a reference to T. derived_from_tag is obvious, but we need more, esp we need the mask, or the indices for unmasking.

It's a bit like when e.g. ConvLayer did downsampling via striding. Then T' = T / 2. And this could be formulated as a mask as well.

Now we can see the unmasking as a sparse-to-dense operation.

However, this is still ambiguous. In sparse-to-dense, we would set 0 for all other frames. But this is rarely what we would want here. Specifically, when reproducing the output of a MaskedComputationLayer, we would have a very specific behavior, namely that the previous masked frame would be copied.

So, when we want to have automatic unmasking of T' to T, when such tensors are combined somewhere, e.g. via Data.get_common_data and then Data.copy_compatible_to, we don't just need the mask but we also need the kind of unmasking, i.e. how to fill the other frames.

albertz commented 2 years ago

We could use the derived_from_op mechanism, via a special op kind "mask", and store the mask in the attributes, along with unmask_type="left" or so, to reflect that we want this specific behavior when unmasking.

This could later be extended to be able to represent sparse data via unmask_type="fill" and unmask_fill_value.

The mask itself would be a Data instance. In our case, it would be of shape [B,T] with dtype bool. However, the shape is probably arbitrary. Although it probably should contain the original dim T.

Or as mentioned, instead of a mask, we could also store the indices, i.e. Data of shape [B,T'] pointing to T.