How to have custom updates for parameters

albertz commented 2 years ago

Common use cases:

running statistics, for example for batch norm (#89)
weight decay (executed before grad update)
constraints, renormalization of vars (executed after grad update)

albertz commented 2 years ago

Note on parameters:

Trainable parameters are usually updated by the optimizer for the defined losses. This is usually via a loop over mini batches over some dataset (or infinite data source) where the optimizer performs an update on the params each step.

The custom update discussed here would be a custom update per step (i.e. per mini batch). This might only make sense for non-trainable params.

Parameters (variables) are created via nn.Parameter and behave just like layers or layer refs.

Due to possible transformations on parameters (e.g. weight norm #91), other code might not always really get a nn.Parameter instance but any nn.Tensor. Although when you want to do custom updates on a parameter, you likely did mark it as auxiliary, and then it should not have been transformed (I assume...).

So maybe the auxiliary flag means exactly that we (might) have a custom update.

The custom update might be conditional, e.g. only be done in training (#18).

In the computation graph, when should the update be done? E.g. when other code reads the variable, should we make sure that it was updated before or after? Or this would be defined by the ordering of code / execution, i.e:

p = nn.Parameter(...)
nn.print(p)  # old
p.assign(...)   # custom update (draft API)
nn.print(p)  # new

Is this also well defined when done in a loop or cond (#24)? As mentioned before, it is probably common to make this dependent on the train flag (#18) but it could also depend on other things.

Should we allow multiple assigns? This might get tricky together with loop or cond?

Can an auxiliary parameter also be trainable, i.e. being updated by some optimizer as well? Again the question would be on the order.

albertz commented 2 years ago

Parameter.assign would wrap to tf.assign. There should also be Parameter.assign_add, assign_sub, etc.

What would be the return value? Maybe just None. This is basically related to the question of when this is actually executed, i.e. how the order is defined.

albertz commented 2 years ago

In case this is used in addition to the loss, the question is in what order do we apply both updates, and however the custom update is calculated, should that depend on the grad update or not. When not enforced, this will be non-deterministic, which is maybe not a good idea.

The constraint of a tf.Variable might be a nice and clean generic way which is also used by the optimizer, with the right control dependencies, after the variable was updated by the optimizer. This is formulated as a transformation old -> new, and executed by this TF code:

class _DenseResourceVariableProcessor(_OptimizableVariable):
  ...
  def update_op(self, optimizer, g):
    ...
    update_op = optimizer._resource_apply_dense(g, self._v)
    if self._v.constraint is not None:
      with ops.control_dependencies([update_op]):
        return self._v.assign(self._v.constraint(self._v))
    else:
      return update_op

This is for the case that the optimizer updates the variable. Otherwise we need to take care of it explicitly.

This is maybe still a bit too restrictive. An explicit assign_add or so might be better.

Also, there are probably different use cases, where the custom update might wait first for some other variable update, or must be executed before some other var is updated. When we don't explicitly handle this, then this will be arbitrary and non-deterministic which is maybe not a good idea, unless it is totally independent from all other vars, which might be the case for many use cases (e.g. applying weight decay).

albertz commented 1 year ago

This issue on RETURNN side would allow for sth like this here: https://github.com/rwth-i6/returnn/issues/1214

rwth-i6 / returnn_common

How to have custom updates for parameters #90