keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.69k stars 19.42k forks source link

Keras update after single batch which exceeds the GPU memory #3556

Closed wx405557858 closed 3 years ago

wx405557858 commented 8 years ago

Can keras support to update parameters after a relative large batch size which exceed the GPU memory if feeded in one time? My model now can only be feeded batch_size=4 samples a time due to GPU 12G memory. The loss is difficult to decline when batch_size=4. So I want to update the parameters after 32 samples. Will keras be able to support this? It seems that Caffe can support this. Thanks!

alexeydevederkin commented 5 years ago

@phobrain Seems like an issue with different behavior of division / in python2/python3.

You could try to change computation of completed_updates to that:

  completed_updates = K.cast(K.tf.floordiv(self.iterations, self.accum_iters), K.floatx())

Does it work now in python2?

phobrain commented 5 years ago

It works, and epoch 1 positive holdout accuracy is 'normal' (76%; 82% maybe highest epoch 1 seen).

Again, do these timings make sense? Run time is the the same as with plain adagrad. Given my naivete, I wonder if it could be me, but can't see any way to screw it up. :-) ... aha, unless I'm supposed to double batch_size as well?

adadelta 11/4                 15080/15080 3414s 226ms/step
@Dutil  11/21                 15394/15394 3190s 207ms/step
@alexeydevederkin 11/22       15394/15394 3321s 216ms/step

I let @Dutil 's run a few epochs til the holdout tests got worse, and it didn't make a difference in accuracy range. Since I have a BatchNormalization, the results are not rigorous, so will try without it next (using keyword vectors, since VGG is so slow.. which is where I need it in the end, since 2 224x224 pics at a time means batch=32), so I can compare the two AdamAccumulate versions, and for a while stop wondering whether Khashoggi was the reincarnation of Archduke Franz Ferdinand.

Here are the epoch 1 holdout pos/neg results for both versions, same range as different adadelta runs:

0.719105502911 0.885949990459  2018-11-21 15:01:22.287927
0.764707315291 0.816526127326  2018-11-22 11:07:53.761873

Epoch 4 of @Dutil where I bailed:

0.836703125381 0.754312602258  2018-11-21 19:11:43.344608
phobrain commented 5 years ago

Keyword (binary) vector results, same training pairs of pics involved, batch_size=1024, no BatchNormalization after 1st below. 'Epochs' are 3 epochs each, and go til crude criteria not satisfied.

Total params: 4,609,775
Trainable params: 4,609,263
Non-trainable params: 512  [?]

Adadelta w/ BatchNormalization [Epochs/runs restarted due to decrease in holdout accuracy]

All Epochs     :     1     2     1     2     3     4     5     6
Positive Test %: 84.97 85.19 83.47 83.20 82.53 85.81 83.05 86.05
Negative Test %: 81.69 84.02 82.99 86.60 88.15 84.68 88.28 85.12

Adadelta

All Epochs     :     1     2     3     4     5     6     1     2
Positive Test %: 72.35 72.56 72.69 74.34 80.09 79.67 75.86 75.35
Negative Test %: 87.58 89.92 91.05 91.55 87.06 88.16 83.30 88.05

@Dutil

All Epochs     :     1     2     3     4     5     6     7     8
Positive Test %: 78.20 79.76 81.53 78.54 83.01 81.46 81.94 83.86
Negative Test %: 85.26 87.15 86.88 90.56 86.58 88.48 88.16 86.27

@alexeydevederkin

accum_iters=2, batch_size = 1024 (as above cases)

All Epochs     :     1     2     3     4     5     6     7     8
Positive Test %: 76.05 78.60 80.91 81.87 82.90 83.37 83.47 83.48
Negative Test %: 90.04 89.34 88.72 88.27 87.74 87.41 87.39 87.40

accum_iters=2, batch_size = 512

All Epochs     :     1     2     3     4     5     6     7     8
Positive Test %: 75.16 77.79 81.04 80.42 80.43 81.77 82.87 83.42
Negative Test %: 89.18 89.14 87.26 88.26 89.08 87.92 87.30 86.92

accum_iters=3, batch_size = 1024

All Epochs     :     1     2     3     4     5     6     7     8
Positive Test %: 79.94 80.09 80.53 81.46 81.45 84.51 81.67 82.21
Negative Test %: 85.27 87.39 88.40 88.50 88.74 85.50 88.53 88.07

accum_iters=2, batch_size = 1024

491/491 - 21s 42ms/step - loss: 0.5021 - binary_accuracy: 0.7388 - val_loss: 0.4496 - val_binary_accuracy: 0.8027
491/491 - 16s 32ms/step - loss: 0.4110 - binary_accuracy: 0.8136 - val_loss: 0.3912 - val_binary_accuracy: 0.8428
491/491 - 17s 34ms/step - loss: 0.3814 - binary_accuracy: 0.8298 - val_loss: 0.3945 - val_binary_accuracy: 0.8389

accum_iters=2, batch_size=512

982/982 - 29s 30ms/step - loss: 0.3022 - binary_accuracy: 0.8704 - val_loss: 0.3575 - val_binary_accuracy: 0.8398
982/982 - 27s 28ms/step - loss: 0.2999 - binary_accuracy: 0.8717 - val_loss: 0.3411 - val_binary_accuracy: 0.8457
982/982 - 26s 26ms/step - loss: 0.2979 - binary_accuracy: 0.8722 - val_loss: 0.3732 - val_binary_accuracy: 0.8633

accum_iters=3, batch_size = 1024

491/491 - 19s 38ms/step - loss: 0.3044 - binary_accuracy: 0.8691 - val_loss: 0.3166 - val_binary_accuracy: 0.8682
491/491 - 17s 34ms/step - loss: 0.3010 - binary_accuracy: 0.8705 - val_loss: 0.3338 - val_binary_accuracy: 0.8672
491/491 - 16s 32ms/step - loss: 0.2982 - binary_accuracy: 0.8724 - val_loss: 0.4301 - val_binary_accuracy: 0.8281

accum_iters=4, batch_size = 1024

491/491 - 20s 42ms/step - loss: 0.5332 - binary_accuracy: 0.7058 - val_loss: 0.4413 - val_binary_accuracy: 0.8154
491/491 - 16s 33ms/step - loss: 0.4202 - binary_accuracy: 0.8063 - val_loss: 0.4025 - val_binary_accuracy: 0.8232
491/491 - 17s 34ms/step - loss: 0.3837 - binary_accuracy: 0.8277 - val_loss: 0.3498 - val_binary_accuracy: 0.8525

The test case answers my question about batch size: it is reduced by the acum_iters factor:

model_2.fit(train_images, train_labels, epochs=5, batch_size=32,
accum_iters=8)
...
model_3.fit(train_images, train_labels, epochs=5, batch_size=4,
alexeydevederkin commented 5 years ago

Run time of optimizer with accumulation should be similar to run time of optimizer without accumulation with the same batch_size (but not effective batch size).

For example, run time of AdamAccumulate(accum_iters=8) & batch_size=4 = run time of Adam & batch_size=4, but not Adam & batch_size=32, because although it behaves like Adam & batch_size=32 it anyway processes batches with the size of 4.

I would guess that the way we tweak optimizers here won't work with BatchNormalization layer.

phobrain commented 5 years ago

An answer to my naive expectation of a different epoch time is that the same number of cases are being processed, the only diff is the accounting. I realized the thing to do is try batch_size=64 with VGG16, i.e. 2x what I can fit in memory, and, forgetting to recomment out BatchNorm I get

Positive Test %: 72.89 76.88 76.96
Negative Test %: 88.48 87.68 88.78

Retrying w/out BatchNorm.

phobrain commented 5 years ago

batch_size=64, accum_iters=2: one run: positive test always<80%, dropped to 50's after a few epochs.

batch_size=96, accum_iters=3

Positive Test %: 76.28 75.90 77.85 79.01 80.40 81.54 75.06 => quit
Negative Test %: 86.19 88.94 88.11 88.15 87.59 86.29

batch_size=128, accum_iters=4 [got low memory msgs; OOM failure on higher batch]

Positive Test %: 78.05 79.84 80.39 74.30 => quit
Negative Test %: 85.35 85.81 86.29

Adam w/ BatchNorm, batch_size=128 fits w/ Adam, it turns out.

Positive Test %: 72.68 76.61 78.60 80.91 82.02 79.98
Negative Test %: 88.46 87.12 87.48 86.26 84.13 87.96

lr=0.00125

Positive Test %: 73.83 81.73 74.95 -> quit
Negative Test %: 85.67 81.85

NN's are far more fun than horse racing, because the horses are real. In this case, apparently even snails will do.

Some morbid labeling of keyword-vector-net-generated pairs while waiting, makes life bloom anew; pics won't render in chrome/default, since not https:

https://forums.craigslist.org/?ID=295644868

I suspect the limitations in accuracy depend on the types of per-pic data more than batch size or net topology, though BatchNorm gives a tantalizing boost to the convergence rate of positive holdouts with the keywords (above), so I'm hoping a leverageable insight will dawn from that. Histograms plus keyword vectors get positive accuracy up to around 92% (faster runs means big sample), and it seems a convolutional method should get closer to that than ~85%. In the end, I'll mix and match the methods dynamically according to AI personality requirements when interacting.

noamwies commented 5 years ago

here is my solution that works for any optimizer! (with tensorflow backend)

import sys

import tensorflow
from tensorflow.keras import backend as K

def convert_to_accumulate_gradient_optimizer(orig_optimizer, update_params_frequency, accumulate_sum_or_mean=True):
    if update_params_frequency < 1:
        raise ValueError('update_params_frequency must be >= 1')
    print('update_params_frequency: %s' % update_params_frequency)
    print('accumulate_sum_or_mean: %s' % accumulate_sum_or_mean)
    orig_get_gradients = orig_optimizer.get_gradients
    orig_get_updates = orig_optimizer.get_updates
    accumulated_iterations = K.variable(0, dtype='int64', name='accumulated_iterations')
    orig_optimizer.accumulated_iterations = accumulated_iterations

    def updated_get_gradients(self, loss, params):
        return self.accumulate_gradient_accumulators

    def updated_get_updates(self, loss, params):
        self.accumulate_gradient_accumulators = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        updates_accumulated_iterations = K.update_add(accumulated_iterations, 1)
        new_grads = orig_get_gradients(loss, params)
        if not accumulate_sum_or_mean:
            new_grads = [g / K.cast(update_params_frequency, K.dtype(g)) for g in new_grads]
        self.updated_grads = [K.update_add(p, g) for p, g in zip(self.accumulate_gradient_accumulators, new_grads)]
        def update_function():
            with tensorflow.control_dependencies(orig_get_updates(loss, params)):
                reset_grads = [K.update(p, K.zeros(K.int_shape(p), dtype=K.dtype(p))) for p in self.accumulate_gradient_accumulators]
            return tensorflow.group(*(reset_grads + [updates_accumulated_iterations]))
        def just_store_function():
            return tensorflow.group(*[updates_accumulated_iterations])

        update_switch = K.equal((updates_accumulated_iterations) % update_params_frequency, 0)

        with tensorflow.control_dependencies(self.updated_grads):
            self.updates = [K.switch(update_switch, update_function, just_store_function)]
            return self.updates

    orig_optimizer.get_gradients = updated_get_gradients.__get__(orig_optimizer, type(orig_optimizer))
    orig_optimizer.get_updates = updated_get_updates.__get__(orig_optimizer, type(orig_optimizer))

And simple unit tests

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras import backend as K

import numpy as np
import pytest
import tensorflow as tf

def get_simple_linear_model(orig_optimizer, update_params_frequency, accumulate_sum_or_mean):
    inputs = Input(shape=(1, ), dtype='float32')
    outputs = Dense(1, use_bias=False, kernel_initializer='ones')(inputs)
    model = Model(inputs=inputs, outputs=outputs)
    convert_to_accumulate_gradient_optimizer(orig_optimizer, update_params_frequency=update_params_frequency, 
        accumulate_sum_or_mean=accumulate_sum_or_mean)
    def y_loss(y_true, y_pred):
        return K.mean(y_pred)
    def get_w():
        return model.get_weights()[0][0][0]
    def get_sgd_iteration():
        return orig_optimizer.get_weights()[orig_optimizer.weights.index(orig_optimizer.iterations)]
    model.compile(optimizer=orig_optimizer, loss=y_loss)
    return model, get_w, get_sgd_iteration

def test_update_just_when_need():
    model, get_w, get_sgd_iteration = get_simple_linear_model(SGD(lr=1.0), 2, False)
    w_before_call = get_w() 
    model.fit(x=np.array([[2.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    w_after_first_call = get_w()
    global_step_after_first_call = get_sgd_iteration()
    model.fit(x=np.array([[3.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    w_after_second_call = get_w()
    global_step_after_second_call = get_sgd_iteration()
    assert global_step_after_first_call == 0
    assert global_step_after_second_call == 1
    assert w_before_call == 1.0
    assert w_after_first_call == 1.0
    assert w_after_second_call == -1.5

def test_reset_after_update():
    model, get_w, get_sgd_iteration = get_simple_linear_model(SGD(lr=1.0), 1, False)
    model.fit(x=np.array([[2.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    model.fit(x=np.array([[3.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    w_after_second_call = get_w()
    assert w_after_second_call == -4.0
jkjung-avt commented 5 years ago

@noamwies Thanks for sharing the code. I think the following line should be corrected:

updates_accumulated_iterations = K.update_add(accumulated_iterations, 1)

as

updates_accumulated_iterations = K.update_add(self.accumulated_iterations, 1)

bojone commented 5 years ago

My implementation with rewriting optimizer:

https://github.com/bojone/accum_optimizer_for_keras

Pari-singh commented 5 years ago

@alexeydevederkin I am getting the error: File "train.py", line 55, in init super(AdamAccumulate, self).init(**kwargs) TypeError: init() missing 1 required positional argument: 'name'

upon running your code, could you please help me with my problem. I am running it on Python 3.7 and TF2. Also, TF doesnt have keras legacy interfaces, how could we replace your code for Tensorflow? (I installes Keras just for this optimizer).

Thanks a lot in advance

ironbar commented 5 years ago

I have the same problem, I'm trying to get to work a gradient accumulator optimizer with keras and TF2 without success by the moment.

ghost commented 4 years ago

Hi Guys, thanks for the previous code, i have been trying to replicate the same for SGD with nestrov,

class SGDAccum(Optimizer):
    """Stochastic gradient descent optimizer.

    Includes support for momentum,
    learning rate decay, and Nesterov momentum.

    # Arguments
        lr: float >= 0. Learning rate.
        momentum: float >= 0. Parameter updates momentum.
        decay: float >= 0. Learning rate decay over each update.
        nesterov: boolean. Whether to apply Nesterov momentum.
    """

    def __init__(self, lr=0.01, momentum=0., decay=0.,
                 nesterov=False, accum_iters=1, **kwargs):
        super(SGDAccum, self).__init__(**kwargs)
        with K.name_scope(self.__class__.__name__):
            self.iterations = K.variable(0, name='iterations')
            self.lr = K.variable(lr, name='lr')
            self.momentum = K.variable(momentum, name='momentum')
            self.decay = K.variable(decay, name='decay')
            self.accum_iters = K.variable(accum_iters)
        self.initial_decay = decay
        self.nesterov = nesterov

    @interfaces.legacy_get_updates_support
    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]

        lr = self.lr
        if self.initial_decay > 0:
            lr *= (1. / (1. + self.decay * K.cast(self.iterations,
                                                  K.dtype(self.decay))))

        accum_switch = K.equal(self.iterations % self.accum_iters, 0)
        accum_switch = K.cast(accum_switch, dtype='float32')

        # momentum
        shapes = [K.int_shape(p) for p in params]
        moments = [K.zeros(shape) for shape in shapes]
        temp_grads = [K.zeros(shape) for shape in shapes]
        self.weights = [self.iterations] + moments
        for p, cg, m, tg in zip(params, grads, moments, temp_grads):
            g = cg + tg
            v = self.momentum * m - (lr * g / self.accum_iters)  # velocity
            self.updates.append(K.update(m, (1 - accum_switch) * m + accum_switch * v))
            self.updates.append(K.update(tg, (1 - accum_switch) * g))

            if self.nesterov:
                new_p = p + self.momentum * v - (lr * g / self.accum_iters)
            else:
                new_p = p + v

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            self.updates.append(K.update(p, (1 - accum_switch) * p + accum_switch * new_p))
        return self.updates

    def get_config(self):
        config = {'lr': float(K.get_value(self.lr)),
                  'momentum': float(K.get_value(self.momentum)),
                  'decay': float(K.get_value(self.decay)),
                  'nesterov': self.nesterov,
                  'accum_iters': self.accum_iters}
        base_config = super(SGDAccum, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

Can someone please verify that it look's about right ?

@viig99 - Upon using SGD accumulate function, I am getting the error - "TypeError: Not JSON Serializable: <tf.Variable 'SGDAccum_4/Variable:0' shape=() dtype=float32_ref>"

Can you suggest what could be the cause ?

Thanks

hoangcuong2011 commented 4 years ago

It seems like all these code could not run with tensorflow keras. I changed the code to work with TF Keras (e.g. change from keras to tf.keras, btw). The code could be complied but I could run it properly (look like a stuck in something without doing anything) So does anyone know how to do this with tensorflow keras? I googled and could not find any reference indeed. Thx.

jkjung-avt commented 4 years ago

'tensorflow.python.keras.optimizer_v2.OptimizerV2' was introduced since tensorflow 1.13.

https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L70

The design of 'OptimizerV2' seems an overhaul from the original 'Optimizer' class. I think the code snippets above only worked for the old 'Optimizer' class, i.e. only worked for tf.keras optimizers with tensorflow version 1.12 or lower.

hoangcuong2011 commented 4 years ago

Thx for the info @jkjung-avt. I am trying to work with OptimizerV2 but it is indeed not easy.

652994331 commented 4 years ago

My version of Adam optimizer with accumulated gradient (slightly different from @Dutil 's - closer results to Adam)

import keras.backend as K
from keras.legacy import interfaces
from keras.optimizers import Optimizer

class AdamAccumulate(Optimizer):

    def __init__(self, lr=0.001, beta_1=0.9, beta_2=0.999,
                 epsilon=None, decay=0., amsgrad=False, accum_iters=1, **kwargs):
        if accum_iters < 1:
            raise ValueError('accum_iters must be >= 1')
        super(AdamAccumulate, self).__init__(**kwargs)
        with K.name_scope(self.__class__.__name__):
            self.iterations = K.variable(0, dtype='int64', name='iterations')
            self.lr = K.variable(lr, name='lr')
            self.beta_1 = K.variable(beta_1, name='beta_1')
            self.beta_2 = K.variable(beta_2, name='beta_2')
            self.decay = K.variable(decay, name='decay')
        if epsilon is None:
            epsilon = K.epsilon()
        self.epsilon = epsilon
        self.initial_decay = decay
        self.amsgrad = amsgrad
        self.accum_iters = K.variable(accum_iters, K.dtype(self.iterations))
        self.accum_iters_float = K.cast(self.accum_iters, K.floatx())

    @interfaces.legacy_get_updates_support
    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]

        lr = self.lr

        completed_updates = K.cast(K.tf.floordiv(self.iterations, self.accum_iters), K.floatx())

        if self.initial_decay > 0:
            lr = lr * (1. / (1. + self.decay * completed_updates))

        t = completed_updates + 1

        lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) / (1. - K.pow(self.beta_1, t)))

        # self.iterations incremented after processing a batch
        # batch:              1 2 3 4 5 6 7 8 9
        # self.iterations:    0 1 2 3 4 5 6 7 8
        # update_switch = 1:        x       x    (if accum_iters=4)  
        update_switch = K.equal((self.iterations + 1) % self.accum_iters, 0)
        update_switch = K.cast(update_switch, K.floatx())

        ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        gs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]

        if self.amsgrad:
            vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        else:
            vhats = [K.zeros(1) for _ in params]

        self.weights = [self.iterations] + ms + vs + vhats

        for p, g, m, v, vhat, tg in zip(params, grads, ms, vs, vhats, gs):

            sum_grad = tg + g
            avg_grad = sum_grad / self.accum_iters_float

            m_t = (self.beta_1 * m) + (1. - self.beta_1) * avg_grad
            v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(avg_grad)

            if self.amsgrad:
                vhat_t = K.maximum(vhat, v_t)
                p_t = p - lr_t * m_t / (K.sqrt(vhat_t) + self.epsilon)
                self.updates.append(K.update(vhat, (1 - update_switch) * vhat + update_switch * vhat_t))
            else:
                p_t = p - lr_t * m_t / (K.sqrt(v_t) + self.epsilon)

            self.updates.append(K.update(m, (1 - update_switch) * m + update_switch * m_t))
            self.updates.append(K.update(v, (1 - update_switch) * v + update_switch * v_t))
            self.updates.append(K.update(tg, (1 - update_switch) * sum_grad))
            new_p = p_t

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            self.updates.append(K.update(p, (1 - update_switch) * p + update_switch * new_p))
        return self.updates

    def get_config(self):
        config = {'lr': float(K.get_value(self.lr)),
                  'beta_1': float(K.get_value(self.beta_1)),
                  'beta_2': float(K.get_value(self.beta_2)),
                  'decay': float(K.get_value(self.decay)),
                  'epsilon': self.epsilon,
                  'amsgrad': self.amsgrad}
        base_config = super(AdamAccumulate, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

Tests:

Training with Adam, 1st run:
Epoch 1/5
60000/60000 [==============================] - 68s 1ms/step - loss: 1.3168 - acc: 0.6004
Epoch 2/5
60000/60000 [==============================] - 70s 1ms/step - loss: 0.4745 - acc: 0.8595
Epoch 3/5
60000/60000 [==============================] - 69s 1ms/step - loss: 0.3572 - acc: 0.8944
Epoch 4/5
60000/60000 [==============================] - 71s 1ms/step - loss: 0.3018 - acc: 0.9104
Epoch 5/5
60000/60000 [==============================] - 71s 1ms/step - loss: 0.2672 - acc: 0.9201

Training with Adam, 2nd run:
Epoch 1/5
60000/60000 [==============================] - 71s 1ms/step - loss: 1.3168 - acc: 0.6004
Epoch 2/5
60000/60000 [==============================] - 71s 1ms/step - loss: 0.4745 - acc: 0.8595
Epoch 3/5
60000/60000 [==============================] - 67s 1ms/step - loss: 0.3572 - acc: 0.8944
Epoch 4/5
60000/60000 [==============================] - 71s 1ms/step - loss: 0.3018 - acc: 0.9104
Epoch 5/5
60000/60000 [==============================] - 67s 1ms/step - loss: 0.2672 - acc: 0.9201

Training with AdamAccumulate:
Epoch 1/5
60000/60000 [==============================] - 141s 2ms/step - loss: 1.3167 - acc: 0.6004   
Epoch 2/5
60000/60000 [==============================] - 141s 2ms/step - loss: 0.4744 - acc: 0.8596
Epoch 3/5
60000/60000 [==============================] - 136s 2ms/step - loss: 0.3572 - acc: 0.8944
Epoch 4/5
60000/60000 [==============================] - 139s 2ms/step - loss: 0.3018 - acc: 0.9105
Epoch 5/5
60000/60000 [==============================] - 138s 2ms/step - loss: 0.2671 - acc: 0.9201

I'm not very familiar with Tensorflow, but maybe it could be further improved (for speed) by using conditional updates instead of updating variables with the same values.

Hi, could anyone show to to use this code for a bert finetune? I mean should just replace this with bert's optimization.py or do something else? thanks

hoangcuong2011 commented 4 years ago

@652994331 : Are you able to run your code with TF keras? I supposed it does not work when converting the code to TF keras and run it. But please let me know if it is possible from your side. thx.

bojone commented 4 years ago

@652994331 : Are you able to run your code with TF keras? I supposed it does not work when converting the code to TF keras and run it. But please let me know if it is possible from your side. thx.

both keras and tf.keras can refer this: https://github.com/bojone/bert4keras/blob/master/bert4keras/optimizers.py

5hyfilm-zz commented 4 years ago

Has anyone encountered this problem while using AdamAccumulate? TypeError: __init__() missing 1 required positional argument: 'name'

5hyfilm-zz commented 4 years ago

@Pari-singh I encountered this problem and still stuck on this problem. Can you solve it? If already resolved Please tell me

andreped commented 3 years ago

here is my solution that works for any optimizer! (with tensorflow backend)

import sys

import tensorflow
from tensorflow.keras import backend as K

def convert_to_accumulate_gradient_optimizer(orig_optimizer, update_params_frequency, accumulate_sum_or_mean=True):
    if update_params_frequency < 1:
        raise ValueError('update_params_frequency must be >= 1')
    print('update_params_frequency: %s' % update_params_frequency)
    print('accumulate_sum_or_mean: %s' % accumulate_sum_or_mean)
    orig_get_gradients = orig_optimizer.get_gradients
    orig_get_updates = orig_optimizer.get_updates
    accumulated_iterations = K.variable(0, dtype='int64', name='accumulated_iterations')
    orig_optimizer.accumulated_iterations = accumulated_iterations

    def updated_get_gradients(self, loss, params):
        return self.accumulate_gradient_accumulators

    def updated_get_updates(self, loss, params):
        self.accumulate_gradient_accumulators = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        updates_accumulated_iterations = K.update_add(accumulated_iterations, 1)
        new_grads = orig_get_gradients(loss, params)
        if not accumulate_sum_or_mean:
            new_grads = [g / K.cast(update_params_frequency, K.dtype(g)) for g in new_grads]
        self.updated_grads = [K.update_add(p, g) for p, g in zip(self.accumulate_gradient_accumulators, new_grads)]
        def update_function():
            with tensorflow.control_dependencies(orig_get_updates(loss, params)):
                reset_grads = [K.update(p, K.zeros(K.int_shape(p), dtype=K.dtype(p))) for p in self.accumulate_gradient_accumulators]
            return tensorflow.group(*(reset_grads + [updates_accumulated_iterations]))
        def just_store_function():
            return tensorflow.group(*[updates_accumulated_iterations])

        update_switch = K.equal((updates_accumulated_iterations) % update_params_frequency, 0)

        with tensorflow.control_dependencies(self.updated_grads):
            self.updates = [K.switch(update_switch, update_function, just_store_function)]
            return self.updates

    orig_optimizer.get_gradients = updated_get_gradients.__get__(orig_optimizer, type(orig_optimizer))
    orig_optimizer.get_updates = updated_get_updates.__get__(orig_optimizer, type(orig_optimizer))

And simple unit tests

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras import backend as K

import numpy as np
import pytest
import tensorflow as tf

def get_simple_linear_model(orig_optimizer, update_params_frequency, accumulate_sum_or_mean):
    inputs = Input(shape=(1, ), dtype='float32')
    outputs = Dense(1, use_bias=False, kernel_initializer='ones')(inputs)
    model = Model(inputs=inputs, outputs=outputs)
    convert_to_accumulate_gradient_optimizer(orig_optimizer, update_params_frequency=update_params_frequency, 
        accumulate_sum_or_mean=accumulate_sum_or_mean)
    def y_loss(y_true, y_pred):
        return K.mean(y_pred)
    def get_w():
        return model.get_weights()[0][0][0]
    def get_sgd_iteration():
        return orig_optimizer.get_weights()[orig_optimizer.weights.index(orig_optimizer.iterations)]
    model.compile(optimizer=orig_optimizer, loss=y_loss)
    return model, get_w, get_sgd_iteration

def test_update_just_when_need():
    model, get_w, get_sgd_iteration = get_simple_linear_model(SGD(lr=1.0), 2, False)
    w_before_call = get_w() 
    model.fit(x=np.array([[2.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    w_after_first_call = get_w()
    global_step_after_first_call = get_sgd_iteration()
    model.fit(x=np.array([[3.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    w_after_second_call = get_w()
    global_step_after_second_call = get_sgd_iteration()
    assert global_step_after_first_call == 0
    assert global_step_after_second_call == 1
    assert w_before_call == 1.0
    assert w_after_first_call == 1.0
    assert w_after_second_call == -1.5

def test_reset_after_update():
    model, get_w, get_sgd_iteration = get_simple_linear_model(SGD(lr=1.0), 1, False)
    model.fit(x=np.array([[2.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    model.fit(x=np.array([[3.0]], dtype=np.float32), y=np.array([[0.0]], dtype=np.float32), batch_size=1)
    w_after_second_call = get_w()
    assert w_after_second_call == -4.0

Did you verify that the implementation works well, in terms of expected performance and runtime? Would be nice to know. Currently, for accumulated gradients I typically modify the train step to handle when and how gradients are updated. However your approach might be more convenient as it makes it possible to still use model.fit() or model.fit_generator() for the training loop.