rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

SplitLayer is never optimized as output layer #558

Closed Zettelkasten closed 3 years ago

Zettelkasten commented 3 years ago

Consider some net dict of a rec unit like this (see #559 for a full test case):

  input: {...., from: "prev:input"},  # cannot be optimized out
  split: {class: split, from: input, size_splits: [3, 7]},
  output: {class: linear, from: ["split/0", "split/1"]}  # can be optimized out

Then, split should also be optimized out as an output layer. This is not the case however, because: split/0 and split/1 have collocate_with=["split"] set by _TemplateLayer.get_sub_layer, but also an dependency on their parent layer split.

E.g. split/0 is not optimized out because the logic of output_can_move_out(split/0) does the following:

  1. it checks whether the collocated parent split can be moved out with _collocated={split/0}
  2. however, split cannot be moved out because split/1 depends on it. And split/1 has not been moved out yet. [edited.]

Symmetrically, split/1 is also not optimized out.


(Actually, in my use case, I don't even have a layer input that recurrently depends on itself. But instead, I use a :i layer, which currently never is optimized out as an input layer:

      if layer.name in [":i", "end"]:  # currently not fully implemented
        return False

Somewhat related - what is missing of the implementation there?)

albertz commented 3 years ago

Then, split should also be optimized out as an output layer.

Yes.

This is not the case however, because: split/0 and split/1 have collocate_with=["split"] ...

Hm, I think our whole logic with sub layers is a bit buggy, or not even well defined.

For the case of subnetwork sub layers, they are all completely independent (and the main subnet layer is just the same as the output sub layer of it).

Otherwise (basically all other layers which have sub layers), they all belong together. Like in this case for SplitLayer.

Probably this second case is buggy.

I remember that @patrick-wilken did some changes/fixes recently on this?

(Actually, in my use case, I don't even have a layer input that recurrently depends on itself. But instead, I use a :i layer, which currently never is optimized out as an input layer:

      if layer.name in [":i", "end"]:  # currently not fully implemented
        return False

Somewhat related - what is missing of the implementation there?)

Let's keep this as a separate issue, and not mix them up here. This is really unrelated / orthogonal.

The layer responsible for :i (RecStepInfoLayer or sth like that) is just not implemented (as far as I remember).

But also, it should not matter much, as this optimization should really be minor (it would use tf.range or so outside but I don't think that the tf.while_loop for :i would have any measurable effect). And any layers who depend on :i could still be moved out, even if :i itself is not moved out. Because :i itself depends on nothing.

albertz commented 3 years ago

however, split cannot be moved out because there is an dependency on split/1

Is this dependency the right direction? Does this dependency makes sense? I would expect that split/1 depends on split. Maybe that is the real problem here, and with that fixed, it would already work?

Zettelkasten commented 3 years ago

however, split cannot be moved out because there is an dependency on split/1

Is this dependency the right direction? Does this dependency makes sense? I would expect that split/1 depends on split. Maybe that is the real problem here, and with that fixed, it would already work?

Äh my bad, split/1 depends on split. But that means that if split/1 is not optimized out (yet), split should also not be optimized out (as output layer).

Zettelkasten commented 3 years ago

Otherwise (basically all other layers which have sub layers), they all belong together. Like in this case for SplitLayer.

Probably this second case is buggy.

As far as I understand, the layers are kept together by setting collocate_with of all sub layers to the parent. If there is only one sub layer, then this was already handled by the logic: When checking if it should optimize the sub layer sub (here: split/0), it will check if it can optimize out the parent layer if it ignores the dependency to sub. When there is another layer sub2 (here: split/1), that is also collocated to the parent, this fails because the dependency to sub2 is not ignored by this check. So in #559, I updated the logic to not only ignore sub, but also all other layers that should be collocated to the parent.

[regarding :i...] Let's keep this as a separate issue, and not mix them up here. This is really unrelated / orthogonal. [..] And any layers who depend on :i could still be moved out, even if :i itself is not moved out. Because :i itself depends on nothing.

Yes totally, layers that depend on :i of course can be moved out, but they can only be moved out as output layers then. In my case, that does not work, because this logic is buggy right now for SplitLayer. So adding optimize out for the :i layer would act as a workaround for this issue then. (But yes, let's keep this as an independent issue. And let's try to fix this properly of course.)

albertz commented 3 years ago

Hm, so we have both the collocate_with logic and the dependencies logic for sub layers. Maybe we do not need the dependencies logic, and the collocate_with logic covers everything what we need, and maybe then the current logic already works?

Zettelkasten commented 3 years ago

Hm, so we have both the collocate_with logic and the dependencies logic for sub layers. Maybe we do not need the dependencies logic, and the collocate_with logic covers everything what we need, and maybe then the current logic already works?

Hm, yea. I tried it out, but it does not really change anything. It still isn't able to optimize out SplitLayer in my test case. In general, we can change that "a collocated with b" implies "a does not depend on b", but that's orthogonal to this issue (and I don't think it's very useful). It also does not make anything easier in the logic, as we can also check for this explicitly in the logic if we want to, e.g. check for "a depends on b and a is not collocated with b" instead of "a depends on b".

albertz commented 3 years ago

In general, we can change that "a collocated with b" implies "a does not depend on b", but that's orthogonal to this issue (and I don't think it's very useful).

No, I think this doesn't make sense. collocate_with should be orthogonal to the dependency logic. But my earlier comment on collocation was wrong. Collocation is not directed, unlike dependency. It goes both ways (even if the flag in config / layer attribute is directed). But also see my other comments in the PR.