Open albertz opened 2 years ago
Note on parameters:
Trainable parameters are usually updated by the optimizer for the defined losses. This is usually via a loop over mini batches over some dataset (or infinite data source) where the optimizer performs an update on the params each step.
The custom update discussed here would be a custom update per step (i.e. per mini batch). This might only make sense for non-trainable params.
Parameters (variables) are created via nn.Parameter
and behave just like layers or layer refs.
Due to possible transformations on parameters (e.g. weight norm #91), other code might not always really get a nn.Parameter
instance but any nn.Tensor
. Although when you want to do custom updates on a parameter, you likely did mark it as auxiliary
, and then it should not have been transformed (I assume...).
So maybe the auxiliary
flag means exactly that we (might) have a custom update.
The custom update might be conditional, e.g. only be done in training (#18).
In the computation graph, when should the update be done? E.g. when other code reads the variable, should we make sure that it was updated before or after? Or this would be defined by the ordering of code / execution, i.e:
p = nn.Parameter(...)
nn.print(p) # old
p.assign(...) # custom update (draft API)
nn.print(p) # new
Is this also well defined when done in a loop or cond (#24)? As mentioned before, it is probably common to make this dependent on the train flag (#18) but it could also depend on other things.
Should we allow multiple assigns? This might get tricky together with loop or cond?
Can an auxiliary parameter also be trainable, i.e. being updated by some optimizer as well? Again the question would be on the order.
Parameter.assign
would wrap to tf.assign
. There should also be Parameter.assign_add
, assign_sub
, etc.
What would be the return value? Maybe just None
. This is basically related to the question of when this is actually executed, i.e. how the order is defined.
In case this is used in addition to the loss, the question is in what order do we apply both updates, and however the custom update is calculated, should that depend on the grad update or not. When not enforced, this will be non-deterministic, which is maybe not a good idea.
The constraint
of a tf.Variable
might be a nice and clean generic way which is also used by the optimizer, with the right control dependencies, after the variable was updated by the optimizer. This is formulated as a transformation old -> new, and executed by this TF code:
class _DenseResourceVariableProcessor(_OptimizableVariable):
...
def update_op(self, optimizer, g):
...
update_op = optimizer._resource_apply_dense(g, self._v)
if self._v.constraint is not None:
with ops.control_dependencies([update_op]):
return self._v.assign(self._v.constraint(self._v))
else:
return update_op
This is for the case that the optimizer updates the variable. Otherwise we need to take care of it explicitly.
This is maybe still a bit too restrictive. An explicit assign_add
or so might be better.
Also, there are probably different use cases, where the custom update might wait first for some other variable update, or must be executed before some other var is updated. When we don't explicitly handle this, then this will be arbitrary and non-deterministic which is maybe not a good idea, unless it is totally independent from all other vars, which might be the case for many use cases (e.g. applying weight decay).
This issue on RETURNN side would allow for sth like this here: https://github.com/rwth-i6/returnn/issues/1214
Common use cases: