Design/Handling of dimension tags

albertz commented 3 years ago

Like batch dim, spatial dims (with dynamic lengths), or static dims (named, or also unnamed).

Dim (earlier DimensionTag) in RETURNN. Directly use that, or wrap it somehow?

Should this (the batch dim) include beam information, or be separate from this?

Relevant for all layers which define some shape or dimension (e.g. Const, Variable).

Should this be enforced, i.e. no simple int allowed in n_out or so but always a Dim object? And maybe better use out_dim instead of n_out (consistent with https://github.com/rwth-i6/returnn/issues/597). Edit: It was decided to make nn.Dim mandatory, and use out_dim instead of n_out.

Very related is this issue on RETURNN side on explicit dim tags: https://github.com/rwth-i6/returnn/issues/597

Related is also whether we want unique dim tags? (#48, https://github.com/rwth-i6/returnn/issues/632)

This issues covers multiple aspects:

[x] Use dim tags. Directly use RETURNN Dim. We inherit all its logic on equality etc. We also have FeatureDim, SpatialDim from RETURNN.
[x] Dim tags (Dim instances) are mandatory for any shape or size
[x] Shape for all tensors is always available (via #47)
[x] We have Tensor.verify_out_shape for easy verification
[x] Solution for in_dim == out_dim, square matrices, https://github.com/rwth-i6/returnn_common/issues/17#issuecomment-997312167, https://github.com/rwth-i6/returnn/pull/871
[x] Instead of out_spatial_dim argument, the a new spatial dim gets returned. See pool1d for example.
[x] Tensor shape annotations, moved to #97

Atticus1806 commented 3 years ago

Would this replace the possibility to define out_type (which you would do in RETURNN)? I didn't continue with the debugging yet, but I think I got some kind of related issue in my config right now. Cannot tell for sure, will report back once I had time to check.

albertz commented 3 years ago

I'm not sure yet whether this would at the same time remove the old way, or just be an additional way.

In any case, I don't think this would really change much in behavior, depending on what you did. So whatever problem you had, you might have it again now. But not sure about the details on your issue. You maybe should report a separate issue about it. But also not sure if this is even related here to returnn-common, or just RETURNN itself?

But also, in general, you never should use out_type anyway. If you do, this sounds like you probably should do it in a different way.

The dimension tags would be used whenever you must specify an axis (for Reduce, Dot, etc), or a new shape (for Const, Variable, etc).

In general, I want to push the concept of named tensors (named array axes; i.e. dimension tags in RETURNN) much more. I want to deprecate or even not allow axis specified as integers. Axis specified as integers (index) can very easily become a problem because the index can easily change (because we optimize sth differently, then the order of axis will be different, or the user adds some other layers and doesn't notice, etc). So the order of axes should not matter at no point.

Atticus1806 commented 3 years ago

But also, in general, you never should use out_type anyway. If you do, this sounds like you probably should do it in a different way.

Without further investigation on my own config, how would this work with an Eval layer here? Is returnn able to (or should be able to) implicitly solve the eval and handle the dimension?

In general, I want to push the concept of named tensors (named array axes; i.e. dimension tags in RETURNN) much more.

I like this idea, this would make things a lot more straight forward too, I think.

albertz commented 3 years ago

But also, in general, you never should use out_type anyway. If you do, this sounds like you probably should do it in a different way.

Without further investigation on my own config, how would this work with an Eval layer here? Is returnn able to (or should be able to) implicitly solve the eval and handle the dimension?

The EvalLayer should never change the shape. EvalLayer was never intended for this (although I know that some people abuse it for that). All reshaping logic should be handled by other layers. If you think sth is missing in RETURNN, open a separate issue in RETURNN about this.

But for returnn-common, we don't even need Eval at all because any possible use case is much more elegant using directly the returnn-common features. The possible use cases was actually just convenience, to write sth like a * b + c * d in a simpler way. Although that was already possible before in RETURNN using CombineLayer with kind mul and add. But just a bit annoying to write a lot of separate layers for that. But now in returnn-common, you can simply directly write sth like a * b + c * d. So no need for Eval.

If sth is not clear about this, please open a separate issue, as this is all off-topic now from this issue here.

albertz commented 2 years ago

We should clarify this because this is very fundamental to how all our code will look like.

The main question is to decide on these two options:

~~No special handling here in returnn-common for dim tags. Basically just as optional as in RETURNN.~~
Make it mandatory to use. This is whenever axis, axes, new_dim is involved, but also for out_dim in Linear, LSTM, Conv and some others. Edit we decided to do this.

I think when we require them, we should also require them to be unique.

I tend to prefer having them mandatory. This might make some parts maybe a bit more verbose from the first glance. But this should resolve any possible ambiguity on axes, and make the code very explicit and clear.

We might however introduce some shorter aliases, or maybe restructure things a bit. For example:

Dim = DimensionTag Edit: Done. Actually the other way around, it was renamed to Dim, and we have an alias for DimensionTag for older code (mostly RETURNN, not relevant for returnn_common).
Have one global B or Batch or batch_dim for Dim(kind=DimensionTag.Types.Batch). Edit We have a global batch_dim now.
F(dim, name=None, **kwargs) for Dim(kind=DimensionTag.Types.Feature, dimension=dim, description=name, **kwargs). Edit We have FeatureDim(...) now.
S(name=None, **kwargs) for Dim(kind=DimensionTag.Types.Spatial, description=name, **kwargs). Edit We have SpatialDim(...) now.

So in your config, it could look like:

input_dim = nn.FeatureDim("mfcc", 40)
classes_dim = nn.FeatureDim("cart", 5000)
time_dim = nn.SpatialDim("time")

extern_data = {
  "data": {"dim_tags": [nn.batch_dim, time_dim, input_dim]},
  "classes": {"dim_tags": [nn.batch_dim, time_dim], "sparse_dim": classes_dim}
}

class Model(nn.Module):
  def __init__(self, in_dim: nn.Dim, out_dim: nn.Dim):
    super().__init__()
    self.linear = nn.Linear(out_dim=out_dim, in_dim=in_dim)

  def forward(self, input: nn.LayerRef) -> Layer:
    return self.linear(input)

model = Model(in_dim=input_dim, out_dim=classes_dim)

...

(Just a draft.)

albertz commented 2 years ago

Not much feedback has been given here. So you agree?

albertz commented 2 years ago

We should maybe clarify some of the implications. For lots of code, not much changes, except the signature changes type int to DimensionTag.

When new dimensions are introduced, this looks a bit more verbose.

albertz commented 2 years ago

Would this imply that "F" or "T" are also not allowed anymore?

JackTemaki commented 2 years ago

Mandatory is fine for me.

Would this imply that "F" or "T" are also not allowed anymore?

I do not have enough knowledge to understand the consequences of this.

albertz commented 2 years ago

Would this imply that "F" or "T" are also not allowed anymore?

I do not have enough knowledge to understand the consequences of this.

Sorry, maybe I was confusing here. This is a design decision we should take. We can either still allow "F" and "T" or not. Very related is also a similar question on RETURNN side: https://github.com/rwth-i6/returnn/issues/586

Or maybe we just should change the behavior (on RETURNN side or here in returnn-common): Whenever "F" or "T" is non-ambiguous (according to some sensible rules), then it is allowed, otherwise not. The rules could be:

F == static dimension (only valid if there is exactly one static dimension)
T == dynamic dimension (only valid if there is exactly one dynamic dimension; excluding batch dim of course)
B == batch dim (only valid if there is only exactly one batch dim)

albertz commented 2 years ago

Some aspects on DimensionTag should maybe also be simplified directly on RETURNN side. Here is the corresponding issue: https://github.com/rwth-i6/returnn/issues/782

albertz commented 2 years ago

We have some simplifications on RETURNN side now, via https://github.com/rwth-i6/returnn/pull/822.

Specifically, the class was renamed to simply Dim. And we have helpers SpatialDim(...) and FeatureDim(...) for the user.

albertz commented 2 years ago

We decided to make it mandatory. This is implemented now but still needs some follow-up fixes.

Also, we again need to think about changing or extending the way to get extern data into the network. Specifically, where and how the dim tags of extern data are defined. So related is #44 and maybe also #56.

JackTemaki commented 2 years ago

I think the external data is a rather small problem here, because there it is clear they come from "outside" the network. I think the bigger difficulty is on the decision where to actually define the dimension tags that are needed within modules, and how to collect them to write them in a config.

For example the current code of the SelfAttention is problematic in that sense:

class SelfAttention(SelfAttentionBase):
  """
  Classic self attention
  """
  @nn.scoped
  def __call__(self, source: nn.LayerRef, *, axis: nn.Dim) -> nn.Layer:
    """forward"""
    expand_dim = nn.SpatialDim("self_att_expand_dim")
    qkv = self.qkv(source)
    qkv = nn.split_dims(
      qkv, axis=self.qkv_dim_total, dims=(self.num_heads, self.qkv_dim_per_head), name="qkv_split_dims")
    q, k, v = nn.split(
      qkv, axis=self.qkv_dim_per_head,
      out_dims=(self.key_dim_per_head, self.key_dim_per_head, self.value_dim_per_head),
      name="qkv_split")
    k = nn.reinterpret_data(k, set_dim_tags={axis: expand_dim}, name="k_new_dim")
    v = nn.reinterpret_data(v, set_dim_tags={axis: expand_dim}, name="v_new_dim")
    att = dot_attention(q, k, v, key_dim=self.key_dim_per_head, axis=expand_dim, att_dropout=self.att_dropout)
    output = nn.merge_dims(
      att, axes=(self.num_heads, self.value_dim_per_head), out_dim=self.value_dim_total, name="output")
    return output

Here expand_dim is a local variable that is not accessable from the outside. So the question is: is it good that it always has the same name for all instances of SelfAttention you would create/apply? Especially when you would apply the same SelfAttention on two different inputs (axis would be a different Dim) I am not sure if this will work at all then.

Then, the returnn_common functionality needs to be fully serializable. A random idea of me would be that you need to collect all locally created dims and add them either to the return value or a local variable (I guess the later makes more sense), and the nn.scoped wrapper can then collect this properly in the sense that it adds maybe a unique name prefix for the dim in relation to the subnetwork name the __call__ result will be assigned to.

And then a context needs to be able to return all dim instances that were created anywhere, so that this is properly serializable.

The serialization itself is of course another issue, especially when dim tags are passed among many modules, but I guess this is not a problem related to returnn_common itself but more about the way in which we would use it in combination with Sisyphus. But as long as each dimension tag object instance has a unique name (~~if not you could use random strings, but this would look ugly~~ has to be unique and static with respect to the module for proper hashing) and a correct __repr__ implementation this is easy to solve on the Sisyphus side.

albertz commented 2 years ago

I think the external data is a rather small problem here, because there it is clear they come from "outside" the network. I think the bigger difficulty is on the decision where to actually define the dimension tags that are needed within modules, and how to collect them to write them in a config.

I don't really see a problem there.

All dims which come from outside and which are relevant for a module, for its inputs, would be passed as arguments (axis, in_dim, in_spatial_dim, whatever).
All dims which are configurable and supposed to be used by the module for some output dimension would be passed as arguments as well (out_dim, out_spatial_dim, whatever).
All dims which are only used within the module for some intermediate calculations can stay completely within the module. They would never be visible to the outside.

For example the current code of the SelfAttention is problematic in that sense:
class SelfAttention(SelfAttentionBase):
  """
  Classic self attention
  """
  @nn.scoped
  def __call__(self, source: nn.LayerRef, *, axis: nn.Dim) -> nn.Layer:
    """forward"""
    expand_dim = nn.SpatialDim("self_att_expand_dim")
    ...
Here expand_dim is a local variable that is not accessable from the outside. So the question is: is it good that it always has the same name for all instances of SelfAttention you would create/apply? Especially when you would apply the same SelfAttention on two different inputs (axis would be a different Dim) I am not sure if this will work at all then.

expand_dim here is only used inside the module. It is also never visible to the outside. Why should it matter? Why would it not work? Where exactly is the problem? Why is the name a problem?

albertz commented 2 years ago

Regarding the serialization, I think we discussed this already. This is easily doable. I think I would just handle this in Net.make_net_dict_raw or so at the end. It would simply collect all Dim instances, put them into some dict (dims or so) which would have unique names (or any keys) for all the dim tags, and then we would replace the Dim instance by some DimProxy or so where the __repr__ would look like dims[key]. And then next to (before) the network, you would also put the dims in the RETURNN config.

We cannot make use of Dim.__repr__ because we always have Dim(...) != Dim(...) (different instances are different). We had some related discussion in https://github.com/rwth-i6/returnn/issues/860.

Beside network and dims, I think we might also need to put other options. E.g. some aspects here assume a recent RETURNN behavior_version, at least 11. I think this should also be put there. I'm currently thinking about also handling extern_data completely via returnn-common, such that you have the extern data dims directly available, so then we would also create that.

JackTemaki commented 2 years ago

The second reply answers all the questions regarding serialization and unique names.

I see that I did not specify my first question good. I discussed this with Benedikt and I am not sure if this is even a problem. I thought about debugging, where you can have an error with a dimension, but if it is created with the same name at e.g. 50 locations you might not be able to pinpoint which dim exactly caused the error. But I guess in practice the error will always occur in conjunction with the layer where it is used, so I guess this is no problem at all. Sorry for the misleading question.

albertz commented 2 years ago

I think just experience when using this will tell us whether and where we can maybe improve sth, and what we could improve. Certain things are very simple to improve. E.g. we could somehow attach the context where a Dim was created to the Dim instance, if that really turns out to be a problem.

Zettelkasten commented 2 years ago

All dims which are configurable and supposed to be used by the module for some output dimension would be passed as arguments as well (out_dim, out_spatial_dim, whatever).

Also, we would need to think about how out_shape should work. Usually, you have some set of common dims (like the batch dim, but maybe more), that the module will not operate on. We would also need to pass a set of these to each layer in some way. Or maybe we could also supply some explicit out_shape to the module, and then infer it from there.

albertz commented 2 years ago

All dims which are configurable and supposed to be used by the module for some output dimension would be passed as arguments as well (out_dim, out_spatial_dim, whatever).

Also, we would need to think about how out_shape should work. Usually, you have some set of common dims (like the batch dim, but maybe more), that the module will not operate on. We would also need to pass a set of these to each layer in some way. Or maybe we could also supply some explicit out_shape to the module, and then infer it from there.

Ah yes, good point. Maybe for the __call__, there could be another in_shape argument and then the module can derive it and add such checks on intermediate outputs. Or maybe better, via #47, we anyway have the shape of any inputs already, so no need for an extra in_shape. The user could have checked in advance that the inputs are as expected.

There could be an out_shape argument for module __call__. Although some modules might return multiple things, then this is a bit ambiguous. Also we cannot really make sure that all modules have this argument, so this would more be a convention than a strict rule. But this is ok.

Or, when we do #47, the user could explicitly do it afterwards like:

out = model(..., out_dim=feature_dim)
out.assert_shape({batch_dim, time_dim, feature_dim})

Maybe that is cleaner?

Zettelkasten commented 2 years ago

Or maybe better, via #47, we anyway have the shape of any inputs already, so no need for an extra in_shape. The user could have checked in advance that the inputs are as expected

That would be even better, yes. Then every module checking shapes (which ideally, every module should do I think), would have something like this on top:

common_dims = in.get_shape() - {in_dim}
out = ...
out.assert_shape(common_dims | {out_dim})

Or, when we do #47, the user could explicitly do it afterwards like:
out = model(..., out_dim=feature_dim)
out.assert_shape({batch_dim, time_dim, feature_dim})
Maybe that is cleaner?

This wouldn't work for intermediate states right? (Also, when nesting modules, the outer module would also need to know the common dims to assert the output of the inner module). I think more complicated modules should check their output type themselves, too. Just like you would specify the return type of a function, this is very important documentation.

albertz commented 2 years ago

The out_shape or assert_shape is anyway mostly for the user. A user might not care what happens inside the model, only that he gets some specific output in the end.

However, if you are the author of model, or you want that any user can easily read the code of your model, then sure, you would also have out_shape or assert_shape usages inside the model, for any intermediate outputs, and maybe also the final output.

We should be careful to not make it too verbose or too complicated. When the code looks complex from a first glance, this is also bad. out_shape might be a bit better w.r.t. this aspect because it avoids a separate call assert_shape. But we anyway could support both.

Another aspect: Probably many modules will be lazy w.r.t. the input dim, just like it is standard in RETURNN and other frameworks. E.g. the user would specify out_dim but not in_dim. So in_dim is only known on the first call to the module, and then automatically inferred from the input (in.in_dim or so, which refers to the feature dim). I think enforcing to specify in_dim at module creation would be too verbose in many cases, so that is why we would have to support such lazy in_dim, and also using the feature dim logic.

albertz commented 2 years ago

So, I currently have this Linear module implementation:

class Linear(nn.Module):
  """
  Linear transformation.
  """

  def __init__(self, out_dim: nn.Dim, *, in_dim: Optional[nn.Dim] = None):
    super().__init__()
    self.out_dim = out_dim
    self.out_dim_inner = out_dim
    self.in_dim = in_dim
    self.weight = None  # type: Optional[nn.Parameter]
    self.bias = None  # type: Optional[nn.Parameter]
    if in_dim:
      self._lazy_init(in_dim)

  def _lazy_init(self, in_dim: nn.Dim):
    if self.in_dim:
      assert self.in_dim == in_dim
    else:
      self.in_dim = in_dim
      if in_dim == self.out_dim:
        self.out_dim_inner = self.out_dim.copy(same_as_self=False, description=f"{self}:out-dim-inner")
      self.weight = nn.Parameter((self.in_dim, self.out_dim_inner))
      self.bias = nn.Parameter((self.out_dim_inner,))

  @nn.scoped
  def __call__(self, source: nn.LayerRef) -> nn.Layer:
    self._lazy_init(source.dim)
    out = nn.dot(source, self.weight, reduce=self.in_dim) + self.bias
    if self.out_dim_inner != self.out_dim:
      out = nn.reinterpret_data(out, set_dim_tags={self.out_dim_inner: self.out_dim})
    return out

You see that I needed to introduce this out_dim_inner here. The problem occurs when we have in_dim == out_dim. The dim tags in the parameter are not distinguishable then, and the dot reduce=in_dim is ambiguous.

I'm not really sure about the best way to solve this. I'm not really happy with this solution now.

@Zettelkasten any suggestions or ideas?

albertz commented 2 years ago

While further thinking about this, some thoughts and options:

I think no matter what we do, we should not make an exception that the order of axes matters here now in this case. The order should never matter.
So, in dim_tags = (in_dim, out_dim) with out_dim = in_dim, we need some way to distinguish those dims. But this cannot be just the order.
When nothing is done, I think the current behavior of RETURNN is actually weird or wrong anyway. It looks like it reduces both axes and then fails with some strange other error? There should be some further testing and fixing on RETURNN side to better catch this with a better error message. Already get_axis_from_description should fail due to the ambiguity.
One option to distinguish them is in my example code above. I.e. in the case in_dim == out_dim, it does out_dim_ = out_dim.copy(same_as_self=False, ...) and then using shape (in_dim, out_dim_).
- I don't really like this too much because other code might expect the shape (in_dim, out_dim). Or all other code must also have this special logic for this special case.
- We probably have this issue also in other instances, e.g. for LSTM or others, when there is some matrix multiplication involved. It should be simple to always make use of the same logic for this case.
- It's too complicated, too ugly.
When reduce=in_dim is searching for matching dims, we need some way that it uniquely finds the right axis.
Maybe we can have some matching_priority flag or so, set on dim tags. In get_axis_from_description, when a dim tag matches multiple existing dim tags, but one of the matching tags has a higher matching priority, then it would still be allowed, and it would use that one. So in the example, in_dim == out_dim (but in_dim is not out_dim) but in_dim would be a copy with higher matching priority. The code would maybe look like:
```
weight = nn.Parameter((in_dim.copy(matching_priority=1), out_dim))
```
Edit I implemented this now here: https://github.com/rwth-i6/returnn/pull/871

albertz commented 2 years ago

Another thing: I wonder whether passing out_spatial_dim as argument is maybe un-intuitive. In many cases, it is a new dimension. So (also as the name implies) it is an output of the function or module. The current code often looks like this:

out_spatial_dim = nn.SpatialDim("whatever")
out = self.conv(..., out_spatial_dim=out_spatial_dim)

This is for dynamic sizes. For static dims, it is even more complicated, as you need to calculate the dim by hand (although maybe you could argue, it is a good thing to have it explicit).

Maybe the module should instead just create such a dim itself, and then return it?

One question is, how should it be returned? Just Tuple[nn.Layer, nn.Dim]? Or namedtuple? Or sth else?

For example, it would then look like:

out, out_spatial_dim = self.conv(...)

I think this is nicer.

The case where it is known beforehand (or expected to match some other existing dim) is maybe valid, but the user could easily still do that by calling declare_same_as afterwards.

out_dim is different though. out_dim would stay an argument. This is because out_dim is usually a feature dim and usually can be arbitrarily defined by the user, like Linear or Conv. Although, maybe for the cases where it is not arbitrary, like merge_dims for example, maybe there it should be returned as well?

Or consider the ConformerConvSubsample.__call__:

@nn.scoped
def __call__(self, inp: nn.LayerRef, *, in_spatial_dim: nn.Dim, out_spatial_dim: nn.Dim) -> nn.LayerRef:
  """forward"""
  in_spatial_dims = [in_spatial_dim, inp.feature_dim]
  in_dim = nn.FeatureDim("dummy-input-feature-dim", 1)
  x = nn.expand_dim(inp, dim=in_dim)
  for i, conv_layer in enumerate(self.conv_layers):
    out_spatial_dims = [nn.SpatialDim(f"conv-{i}-1"), nn.SpatialDim(f"conv-{i}-2")]
    x = conv_layer(x, in_dim=in_dim, in_spatial_dims=in_spatial_dims, out_spatial_dims=out_spatial_dims)
    in_spatial_dims = out_spatial_dims
    in_dim = conv_layer.out_dim
    x = self.activation(x)
    if self.pool_sizes and i < len(self.pool_sizes):
      x = nn.pool(
        x, in_dim=in_dim, in_spatial_dims=in_spatial_dims,
        pool_size=self.pool_sizes[i], padding='same', mode='max')
    if self.dropout:
      x = nn.dropout(x, axis=in_dim, dropout=self.dropout)
  out = nn.merge_dims(x, axes=in_spatial_dims, out_dim=out_spatial_dim)
  return out

With the suggestion, it would look like:

@nn.scoped
def __call__(self, inp: nn.LayerRef, *, in_spatial_dim: nn.Dim) -> nn.LayerRef:
  """forward"""
  in_spatial_dims = [in_spatial_dim, inp.feature_dim]
  in_dim = nn.FeatureDim("dummy-input-feature-dim", 1)
  x = nn.expand_dim(inp, dim=in_dim)
  for i, conv_layer in enumerate(self.conv_layers):
    x, in_spatial_dims = conv_layer(x, in_dim=in_dim, in_spatial_dims=in_spatial_dims)
    in_dim = conv_layer.out_dim
    x = self.activation(x)
    if self.pool_sizes and i < len(self.pool_sizes):
      x = nn.pool(
        x, in_dim=in_dim, in_spatial_dims=in_spatial_dims,
        pool_size=self.pool_sizes[i], padding='same', mode='max')
    if self.dropout:
      x = nn.dropout(x, axis=in_dim, dropout=self.dropout)
  out, out_spatial_dim = nn.merge_dims(x, axes=in_spatial_dims)
  return out, out_spatial_dim

@Zettelkasten opinions?

Edit This is implemented like this now.

albertz commented 2 years ago

The last suggestion was implemented now, i.e. namely that functions and modules return a potentially new dim tag, and the argument (e.g. out_spatial_dim) is optional, and when None, it auto-creates some new dim.

albertz commented 2 years ago

Remaining is the issue with out_dim_inner. There is still the suggested solution here: https://github.com/rwth-i6/returnn/pull/871 This also has some further discussion. In any case, out_dim_inner should be removed. The param should be enough.

albertz commented 2 years ago

Another external data point is Mesh TensorFlow, and the gpt-neo implementation. Examples:

dim_kv = mtf.Dimension("features_per_head", params["n_embd"] // params["n_head"])
mtfparams = mtf.transformer.attention.attention_params_simple(
  x.mesh,
  io_dim=dim_embd,
  kv_dim=dim_kv,
  heads_dim=dim_heads,
  variable_dtype=variable_dtype
)

(via)

k = mtf.replace_dimensions(k, k.shape[1], memory_length_dim)
v = mtf.replace_dimensions(v, v.shape[1], memory_length_dim)

a = mtf_transformer.attention.attention(
  q, k, v,
  memory_length_dim=memory_length_dim,
  key_dim=dim_kv,
  value_dim=dim_kv,
  bias=broadcasted_bias,
  dropout_rate=attn_dropout_rate
)

def causal_linear_attention(q, k, v, eps = 1e-6):
  batch_dim, seq_dim, head_dim, dim_out = (v.shape[0], v.shape[1], v.shape[2], v.shape[3])
  q = mtf.rename_dimension(q, "features_per_head", "features_per_head_in")
  k = mtf.rename_dimension(k, "features_per_head", "features_per_head_in")

  dim_in = k.shape[-1]

  q = mtf.softmax(q, dim_in)
  k = mtf.exp(k)

  cumulative_k = mtf.cumsum(k, seq_dim) + eps
  D_inv = 1. / mtf.einsum([q, cumulative_k], output_shape=[batch_dim, seq_dim, head_dim])

  context = mtf.einsum([k, v], output_shape=[batch_dim, seq_dim, head_dim, dim_in, dim_out])
  cumulative_context = mtf.cumsum(context, seq_dim)

  attn = mtf.einsum([q, cumulative_context, D_inv], output_shape=[batch_dim, seq_dim, head_dim, dim_out])
  return attn

(via)

def einsum(xs, output_shape=None, reduced_dims=None, name=None):
...

(here)

Note that a mtf.Dimension is very simple. It's just (def):

Dimension = collections.namedtuple("Dimension", ["name", "size"])

where name: str and size: int.

MTF also requires unique dimensions in a shape (see here).

albertz commented 2 years ago

(Edit I moved this to another separate issue: #97)

Some further thoughts:

out_shape or assert_shape is about checking the complete shape including all dims. The user or developer might use this as a kind of verification.

The API of any module or function might (should) however be flexible to any number of dimensions. Basically, for some input tensor, there are maybe one or more dims which are relevant (in_dim, in_spatial_dims, axis, or whatever) and there could also be any other number of dims which will then be treated just as the batch dim.

This is maybe a separate aspect. The definition of module/function API is maybe not covered well yet. The usual syntax is via type annotations. There this new Python PEP is relevant, although we would want to differ in a few aspects:

Type annotations are usually only checked statically by external software (MyPy or PyCharm). We might want to check it always, also at runtime.
We would not define a shape tuple but a set of dims for input and output.
This set of dims would not be complete. There could always be other dims like batch dim and others.
In the output shape, the same.
This PEP is only supported since Python 3.10 but we want to support Python >=3.6 or so.

One question is whether we should somehow be more explicit when there are multiple inputs, whether the output would include all dims from all inputs, or sth else. E.g. maybe it could look like:

func(a: Tensor[{A, B, *more_a}], b: Tensor[{A, B, *more_b}])
  -> Tensor[{A, B, *more_a.union(more_b)}]

Or simpler:

func(a: Tensor[{A, B, Any}], b: Tensor[{A, B, Any}])
  -> Tensor[{A, B, Any}]

With the assumption that the output would always be the union of the inputs.

Or more specific:

func(a: Tensor[{A, B, Any}], b: Tensor[{A, B, Any}])
  -> Tensor[{A, B, Any(a, b)}]

Or even more simpler:

func(a: Tensor[{A, B}], b: Tensor[{A, B}])
  -> Tensor[{A, B}]

With the assumption that there always can be additional dims.

albertz commented 2 years ago

Dim.match_priority support (https://github.com/rwth-i6/returnn/pull/871) is merged now, due to the lack of any better solution for the ambiguous dim tags.

Now we need to adopt the code here accordingly (get rid of out_dim_inner etc).

albertz commented 2 years ago

This is all implemented now.

rwth-i6 / returnn_common

Design/Handling of dimension tags #17