Make training loop and stages explicit?

All the iterative-looking logic currently being implemented, like

y = ...
loss = cross_entropy(...)
loss.mark_as_loss()

this defines what is being calculated per-step. So it assumes one outer loop over the steps. Actually, you can also see this more as a model definition. What actually is done per-step is the model update in training via the optimizer, which is separated here. Actually the optimizer is also not really part of returnn-common yet.

In any case, now mixing this with logic which is done at initialization (also depending on whether this is a new run, starting in epoch 1, or a continued run, loading a previous checkpoint), and with logic done per-epoch, can maybe cause confusion.

Maybe we should make this logic more explicit, the definition what part the calculation is executed in what context. Maybe similar to our Loop (#16) logic with a context manager.

At initialization, we would also handle parameter initialization (#59), or maybe also custom checkpoint loading (#93).

Some suggestion:

# Create model. This will already set default param init.
model = Model(...)

model.param.init = ...  # overwrite with some custom

with nn.epoch_loop():
  ...

  with nn.step_loop():
    x = nn.extern_data(...)
    y = model(x)
    loss = nn.cross_entropy(y, targets)
    loss.mark_as_loss()

Old suggestion:

model = Model(...)

with nn.init_ctx():
  ...

with nn.step_loop():
  ...

with nn.epoch_loop():
  ...

This however is not so optimal because param init would naturally probably already happen inside the Model constructor. Basically everything before the training loop would be part of the init context. Just like we also have it for Loop.

Also, it does not follow the natural logic that epoch loop is behind step loop. The step loop should be inside the epoch loop.

(This was suggested as part of #93 on the model checkpoint load and store logic.)

Actually the optimizer is also not really part of returnn-common yet.

Why would you actually want that? I get the feeling that returnn-common should now do much more things than originally intended (being a nice interface for network construction & sharing modules). It feels like we would end up with 2 completely distinct interfaces, where one has to be kept compatible so that the other one can be translated into that (translating from returnn-common into a config file). I do not think this is a good idea at all.

Actually the optimizer is also not really part of returnn-common yet.

Why would you actually want that?

For the case that the user wants to define a custom optimizer, or any other custom update rule.

Sure, the user could implement that again in pure TF and then import it. But one main idea of returnn-common is that we do not want to mix custom TF code (with custom layers, or custom EvalLayer, etc), and instead returnn-common (like RETURNN itself) provides all the basic functions already. One reason people used EvalLayer before was also that it was just too annoying or verbose to do it all via layers but this is also the point of returnn-common, that this should be very simple now.

So, again, this is now about having a custom optimizer. This is an optional thing. When you just want to use some of the existing ones, it is all fine.

This is only one small part of the issue here. This issue is also about allowing for more custom training loops, or just making the training loop explicit. Again it is about explicit vs implicit.

I get the feeling that returnn-common should now do much more things than originally intended (being a nice interface for network construction & sharing modules). It feels like we would end up with 2 completely distinct interfaces, where one has to be kept compatible so that the other one can be translated into that (translating from returnn-common into a config file). I do not think this is a good idea at all.

I don't exactly understand. This issue here is about having the training loop or epoch loop explicit, and also maybe having the param update explicit. These are all things which are not really possible in RETURNN itself currently. So these are strict extensions.

These are all things which are not really possible in RETURNN itself currently. So these are strict extensions.

Maybe this issue here is just very unclear formulated. I do not see where the difference to #59 and #93 is. Having epoch based training stages is already possible (it is very easy to map returnn_common outputs to get_network, our current pipeline already does that), and for step-based changes (besides LR stuff) huge changes in RETURNN would be needed, and I would not be in favor of focusing on this, as this is not really a priority. Maybe you can clarify here what additional logic you want to have, I do not understand that yet.

These are all things which are not really possible in RETURNN itself currently. So these are strict extensions.

Maybe this issue here is just very unclear formulated. I do not see where the difference to #59 and #93 is. Having epoch based training stages is already possible (it is very easy to map returnn_common outputs to get_network, our current pipeline already does that),

But this issue here is about making the epoch loop explicit, and then at some later point allow other schemes as well. Currently the high-level logic in RETURNN always looks like this:

for epoch in range(num_epochs):
  (optional: new network, new step function)
  for batch in dataset.get_batches():
    session.run(step)

This is fine for many of the simpler applications and training schemes but not everything can fit into this scheme or would be quite unnatural or inefficient this way.

For example (really random example):

for epoch in range(num_epochs):
  # get alignments for the whole dataset (while not updating the model)
  for batch in dataset.get_batches():
    alignments.update(session.run(get_alignments()))

  # now update the model using the alignments
  for batch in dataset.get_batches():
    session.run(step)

I'm also thinking about reinforcement learning (#19, #20) (see sequence training as a special case).

Or maybe GAN-style training.

Or some meta learning.

There are many other custom training schemes as well. And while we have not really used them much, they are definitely also used for ASR/NLP applications. The reason we did not use them much is more that it was not easy for us to do so.

Or maybe you don't really want the outer loop over epochs, and have some dataset (maybe also using tf.data.Dataset) which produces just an infinite stream of batches, which is somewhat more standard. E.g. by wrapping any external dataset library like TensorFlow Datasets or even wrapping some PyTorch datastreams or so. Then you would want it to be like:

for step_idx, batch in enumerate(train_dataset.get_batches()):
  session.run(train_step)

  if step_idx % N == 0:  # always after N steps
    eval_model(dev_dataset)
    learning_rate_update()
    store_model()

Or anything else like this. To be able to be flexible on this, it needs to be explicit. If the epoch loop or step loop is not explicit, it becomes either impossible or very difficult to have custom logic there.

Our current approach is to try to fit everything into the scheme above, and the epoch loop and step loop is implicit (not visible in the user config).

Note that this is all technically quite simple, esp when it is explicit.

and for step-based changes (besides LR stuff) huge changes in RETURNN would be needed,

I'm not sure if you mean changes for a single step (within the step-loop), or changes on the step-loop logic itself.

In the first case within the step-loop, I don't understand then, nothing needs to be changed.

In the case for changing the step-loop, yes, this needs an extension on RETURNN side, but this is a very simple extension. Definitely no huge changes are needed.

and I would not be in favor of focusing on this, as this is not really a priority.

Yes, maybe not for the first release (#32) or also not any intermediate usage (#98).

However, when we later want to extend this to make this part flexible and explicit, I wonder if this makes it unnatural then. Because later we do want that the current code still works and not break it anymore. So somehow when no explicit loop is defined, it must somehow fallback to the standard epoch/step loop, and otherwise use the explicit loop. But maybe this is not a problem.

for epoch in range(num_epochs):
  # get alignments for the whole dataset (while not updating the model)
  for batch in dataset.get_batches():
    alignments.update(session.run(get_alignments()))

  # now update the model using the alignments
  for batch in dataset.get_batches():
    session.run(step)

But this is a really good example why your approach is maybe a little bit of a dead end here, at least in the perspective of how we structure experiments at i6. Our current pipelines already support alternating between alignment extraction and training, but of course with separate RETURNN calls, and I think it should stay like this. We already have the problem that returnn_common fits not always optimal in our daily work in terms of Sisyphus integration (#51) because it turned out more complex than originally expected.

When I understand correctly what you imagine, then it seems for me that directly switching to e.g. PyTorch would be more comfortable, because then you start designing your whole training process as you like with a fully tested and widely used framework, and just control it via some fixed parameters (I actually have this scenario in my daily work with two different PyTorch based repositories).

The big advantage of RETURNN was the easily machine readable and writable config format, that can be very flexibly controlled by the Sisyphus pipeline for what ever needs (at least in theory), and now we are going to break this concept with returnn_common. The biggest problem we had was that everyone had its own network definition helpers, and pre-training had to be defined manually. The core task of returnn_common was (for me) to have unified network construction helpers, that allow for easy dynamic network construction. And for most part this is covered now, although it can be improved w.r.t. to converting existing setups (but this is not relevant here).

But now you want returnn_common to be a fully flexible interface than can do any task that you imagine, and this is where it becomes problematic now. First there are now two different repositories returnn and returnn_common which are extremely complex and need to be kept in sync with each other, and returnn_common is importing returnn, which means using it in Sisyphus is already breaking some of the design principles that are usually taken into account (that implementation and pipeline design are separated logically).

So if we would go with your approach, this would mean that the usage of returnn_common would need to be taken out of the pipelines, and we just define fixed systems/configs with returnn_common that we pass to jobs in the pipeline (as I did e.g. with the ReturnnTrainingFromFile job to quickly bring in your hand written systems into Sisyphus). While this would work, and have the advantage we do not need to care here in returnn_common at all about what happens in our pipelines, this is quite contrary to what we (referring e.g. to @mmz33 and @Atticus1806 ) expected returnn_common to do.

for epoch in range(num_epochs):
  # get alignments for the whole dataset (while not updating the model)
  for batch in dataset.get_batches():
    alignments.update(session.run(get_alignments()))

  # now update the model using the alignments
  for batch in dataset.get_batches():
    session.run(step)

But this is a really good example why your approach is maybe a little bit of a dead end here,

Then just imagine any other example. It was just an example. This is really not the point here. Imagine some reinforcement learning example, or whatever.

The point is, there are cases where the standard structure (epoch/step training loop) is not optimal, and it can be useful that the user has flexibility over this. I'm sure you agree on this, right?

at least in the perspective of how we structure experiments at i6. Our current pipelines already support alternating between alignment extraction and training, but of course with separate RETURNN calls,

But even for this example, I would say this would be very suboptimal to do this. So sub-optimal that you really would not do it like this in practice. You do not want separate RETURNN calls here. Recompiling the computation graph, reloading the model into GPU memory, even waiting for a new slot in the cluster, you don't want to have this after every single (sub) epoch. This is not realistic, or would be very slow and lots of wasted computation time.

But anyway, let's not focus too much on this example. It was just an example to show that there are cases where you might want to have some more custom training loop.

and I think it should stay like this. We already have the problem that returnn_common fits not always optimal in our daily work in terms of Sisyphus integration (#51) because it turned out more complex than originally expected.

Discussions regarding Sisyphus hashes should be moved to #51. I don't think this is really a big issue.

For any other issues with Sisyphus integration, please open separate issues. I don't think there are any issues. I'm not aware of any. If you think there are, as said, please open separate issues. Whatever there is, I think we can easily solve it without much effort.

This is somewhat off-topic here in this issue.

When I understand correctly what you imagine, then it seems for me that directly switching to e.g. PyTorch would be more comfortable, because then you start designing your whole training process as you like with a fully tested and widely used framework, and just control it via some fixed parameters (I actually have this scenario in my daily work with two different PyTorch based repositories).

Why is this one aspect of have custom training loops suddenly a reason to move over to a separate framework? This is just a single aspect, which should be a really minor change only for the user perspective, to make the training loop somewhat more custom. The usual case is probably that the user started with the standard training loop, has a working setup, and then wants to do some experiment which requires a custom training loop (e.g. some sort of sequence training).

The big advantage of RETURNN was the easily machine readable and writable config format, that can be very flexibly controlled by the Sisyphus pipeline for what ever needs (at least in theory), and now we are going to break this concept with returnn_common.

I don't understand. How? This issue here is just about the possibility to define a custom training loop.

So, yes, there is the question on having that explicit or allow for the implicit default of using our standard training loop. This was one of the open questions here. So I understand that you argue that you want to have the implicit default? Sure, we can do that. So then this is just optional.

The biggest problem we had was that everyone had its own network definition helpers, and pre-training had to be defined manually. The core task of returnn_common was (for me) to have unified network construction helpers, that allow for easy dynamic network construction. And for most part this is covered now, although it can be improved w.r.t. to converting existing setups (but this is not relevant here).

Yes? How is this related to this issue? Where is the problem?

But now you want returnn_common to be a fully flexible interface than can do any task that you imagine, and this is where it becomes problematic now.

How does this becomes problematic?

RETURNN was from the very beginning intended to be a flexible generic framework. And in many aspects, it is that. Having a custom training loop is not there yet but really this is not such a big thing as you maybe imagine currently.

And again, also as said, this can be an optional thing. In RETURNN itself, it has to be optional because we don't want to break anything.

First there are now two different repositories returnn and returnn_common which are extremely complex and need to be kept in sync with each other,

No, not really. First of all, returnn_common is not really so complex. I'm not sure what you mean. If you think something can be simplified, please make a separate issue, and be more specific.

Then, they also don't need to be in sync. RETURNN has the strict principle to stay backward compatible. So for a fixed returnn-common version, you can always update RETURNN to the latest version.

Once we declare the returnn-common API as stable (e.g. for the first release), we also have the same here as well.

If there are any issues about this, also make a separate issue. There should not be the requirement that they are in sync.

So if we would go with your approach, this would mean that the usage of returnn_common would need to be taken out of the pipelines, and we just define fixed systems/configs with returnn_common that we pass to jobs in the pipeline (as I did e.g. with the ReturnnTrainingFromFile job to quickly bring in your hand written systems into Sisyphus). While this would work, and have the advantage we do not need to care here in returnn_common at all about what happens in our pipelines, this is quite contrary to what we (referring e.g. to @mmz33 and @Atticus1806 ) expected returnn_common to do.

I don't understand what you mean here. What approach are you referring to? Is this related to this issue? If this is not about this issue here, please open a separate issue. We should not mix up things.

We should not mix up things.

I just did not want to open 10 different issues that are somewhat related to this issue here, but put down my thoughts that came up when reading this issue here before I forget about them. I agree that most things are somewhat off-topic. My replies to some of your questions would be far more off-topic now, so I guess it is best if we take this offline (I think @mmz33 already organized a meeting for this).

As we want this to be optional, then this can be extended later, and then this is not critical for the first release.

As another source of inspiration, see PyTorch Lightning (example). There, your root module is derived from a special class pl.LightningModule, and you overwrite special functions like training_step, which are executed in every training step. So this defines the execution stage. So, instead of having the context scopes as suggested initially here, we could have such a root module base class (nn.RootModule?), which serves as a generic interface.

rwth-i6 / returnn_common

Make training loop and stages explicit? #96