miguelmartin75 commented 7 years ago

I've looked through the documentation and can't seem to find anything equivalent to caffe's lr_mult and decay_mult. My assumption is this is not supported/implemented. Is it possible to add this feature?

Incase you don't know what the feature is: essentially for each layer you can supply a lr_mult/decay_mult which is a decay and learning rate multiplier applied to the kernel and bias weights. For example with AlexNet you can see two lr_mult and decay_mult for the convolutional layers where the first lr/decay_mult is applied to the weights and then the second to the bias.


There is a pull request open here:

That PR and the next one linked seemed to be closed. I suppose I could try to update #3004 such that it is 2.0 compatible.

I'd be very thankful. I need this feature to reproduce the results of a paper that uses Caffe. I'm pretty sure I'm not the only one who'd like to be able to easily go from Caffe to keras.

Up. I also need this very important feature.

@miguelmartin75 any updates?

Any update on this? I have found this which a few people claim works.

I think a temporary way to do this is to modify your optimizer, i.e. copy the original keras code of optimizers, and replace every lr with your own definition.

For example, using SGD to train the last layer at lr=0.01, the other lr*0.1=0.001: First, copy the code from keras.optimizers.SGD and define a new optimizer MultiSGD. Make 2 changes:

  1. In __init__, add a list exception_vars and a multiplier=0.1 to the arguments. Variables in the list will not be applied the multiplier.
  2. In get_updates(), add a new line at the beginning of the loop: multiplied_lr = lr * self.multiplier if p in self.exception_vars else lr. Then, in each line where lr is used to calculate updates, i.e. v = self.momentum * m - lr * g and new_p = p + self.momentum * v - lr * g, replace lr with multiplied_lr.

Second, before compiling your model, enumerate variables in each layer:

last_layer_variables = list()
for layer in model.layers:
    if in ['prediction']:
multisgd = MultiSGD(....exception_vars=last_layer_variables, multiplier=0.1)

Then you can use multisgd to compile your model just the same way you use other optimizers.

This is just an example. You can modify other optimizers or apply more complicated multipliers in similar ways. I am not sure if this is 100% correct, but it works perfectly on my computer.

followed @zhenbangchen solution and it works. sample script here :

+1 any updates on this?

+1 Please update this feature.

Any updates on this issue?

Any updates on the pull request above?

Hey @miguelmartin75 I have tried to convert the MultiSGD copied from to be compatible with tensorflow.keras but I got an error:

Error while reading resource variable training/MultiSGD/Variable_180 from Container: localhost. This could mean that the variable was uninitialized

from tensorflow.python.keras.optimizers import Optimizer
from tensorflow.python.keras import backend as K
from tensorflow.python.ops import state_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.util.tf_export import tf_export

class MultiSGD(Optimizer):
    Modified SGD with added support for learning multiplier for kernels and biases
    taken from

    Stochastic gradient descent optimizer.
    Includes support for momentum,
    learning rate decay, and Nesterov momentum.
    lr: float >= 0. Learning rate.
    momentum: float >= 0. Parameter updates momentum.
    decay: float >= 0. Learning rate decay over each update.
    nesterov: boolean. Whether to apply Nesterov momentum.

    def __init__(self, lr=0.01, momentum=0., decay=0.,
                 nesterov=False, lr_mult=None, **kwargs):
        super(MultiSGD, self).__init__(**kwargs)
        with K.name_scope(self.__class__.__name__):
            self.iterations = K.variable(0, dtype='int64', name='iterations')
   = K.variable(lr, name='lr')
            self.momentum = K.variable(momentum, name='momentum')
            self.decay = K.variable(decay, name='decay')
        self.initial_decay = decay
        self.nesterov = nesterov
        self.lr_mult = lr_mult

    # @interfaces.legacy_get_updates_support
    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        # self.updates = [K.update_add(self.iterations, 1)]
        self.updates = [state_ops.assign_add(self.iterations, 1)]

        lr =
        if self.initial_decay > 0:
            # lr *= (1. / (1. + self.decay * K.cast(self.iterations,
            #                                       K.dtype(self.decay))))
            lr = lr * (  # pylint: disable=g-no-augmented-assignment
                    1. / (1. + self.decay * math_ops.cast(self.iterations,
        # momentum
        shapes = [K.int_shape(p) for p in params]
        moments = [K.zeros(shape) for shape in shapes]
        self.weights = [self.iterations] + moments
        for p, g, m in zip(params, grads, moments):

            if in self.lr_mult:
                multiplied_lr = lr * self.lr_mult[]
                multiplied_lr = lr

            v = self.momentum * m - multiplied_lr * g  # velocity
            # self.updates.append(K.update(m, v))
            self.updates.append(state_ops.assign(m, v))

            if self.nesterov:
                new_p = p + self.momentum * v - multiplied_lr * g
                new_p = p + v

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            # self.updates.append(K.update(p, new_p))
            self.updates.append(state_ops.assign(p, new_p))
        return self.updates

    def get_config(self):
        config = {'lr': float(K.get_value(,
                  'momentum': float(K.get_value(self.momentum)),
                  'decay': float(K.get_value(self.decay)),
                  'nesterov': self.nesterov}
        base_config = super(MultiSGD, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))