juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 108 forks source link

Tensorflow Implementation #34

Closed ManoharSai2000 closed 3 years ago

ManoharSai2000 commented 3 years ago

When I tried the optimizer with tensorflow cycle GAN, it takes lot of time to complete one step. Is it a problem regarding the use of gpu or framework, or with the optimizer itself?

Thanks in Advance

sumanth-sadu commented 3 years ago

I have the same issue, can anyone provide some insight?

Thanks

juntang-zhuang commented 3 years ago

The pytorch version does not have such problem. I think it’s due to the implementation, since I’m not so familiar with tensorflow. The pip package 0.1.0 is an old version compared to the source code under pypi_package/adabelief_tf0.1.0/Adabelief_tf.py , this is merged from a pull request by @cryu854, should be optimized, but I have not updated it in pip (so install by pip install adabelief-tf installs an old version). Please try the source code. Let me know if there are any updates.

juntang-zhuang commented 3 years ago

@ManoharSai2000 @sumanthsadhu could you provide the code to reproduce the result?

ManoharSai2000 commented 3 years ago

from future import absolute_import from future import division from future import print_function

import tensorflow as tf

from tabulate import tabulate from colorama import Fore, Back, Style

class AdaBeliefOptimizer(tf.keras.optimizers.Optimizer): """ It implements the AdaBeliefOptimizer proposed by Juntang Zhuang et al. in AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients. Example of usage:

    from adabelief_tf impoty AdaBeliefOptimizer
    opt = AdaBeliefOptimizer(lr=1e-3)
Note: `amsgrad` is not described in the original paper. Use it with
      caution.
AdaBeliefOptimizer is not a placement of the heuristic warmup, the settings should be
kept if warmup has already been employed and tuned in the baseline method.
You can enable warmup by setting `total_steps` and `warmup_proportion`:
```python
opt = AdaBeliefOptimizer(
    lr=1e-3,
    total_steps=10000,
    warmup_proportion=0.1,
    min_lr=1e-5,
)
```
In the above example, the learning rate will increase linearly
from 0 to `lr` in 1000 steps, then decrease linearly from `lr` to `min_lr`
in 9000 steps.
Lookahead, proposed by Michael R. Zhang et.al in the paper
[Lookahead Optimizer: k steps forward, 1 step back]
(https://arxiv.org/abs/1907.08610v1), can be integrated with AdaBeliefOptimizer,
which is announced by Less Wright and the new combined optimizer can also
be called "Ranger". The mechanism can be enabled by using the lookahead
wrapper. For example:
```python
adabelief = AdaBeliefOptimizer()
ranger = tfa.optimizers.Lookahead(adabelief, sync_period=6, slow_step_size=0.5)
```
Example of serialization:
```python
optimizer = AdaBeliefOptimizer(learning_rate=lr_scheduler, weight_decay=wd_scheduler)
config = tf.keras.optimizers.serialize(optimizer)
new_optimizer = tf.keras.optimizers.deserialize(config, custom_objects={"AdaBeliefOptimizer": AdaBeliefOptimizer})
```
"""

def __init__(
    self,
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-14,
    weight_decay=0.0,
    rectify=True,
    amsgrad=False,
    sma_threshold=5.0,
    total_steps=0,
    warmup_proportion=0.1,
    min_lr=0.0,
    name="AdaBeliefOptimizer",
    **kwargs):
    r"""Construct a new AdaBelief optimizer.
    Args:
        learning_rate: A `Tensor` or a floating point value, or a schedule
            that is a `tf.keras.optimizers.schedules.LearningRateSchedule`.
            The learning rate.
        beta_1: A float value or a constant float tensor.
            The exponential decay rate for the 1st moment estimates.
        beta_2: A float value or a constant float tensor.
            The exponential decay rate for the 2nd moment estimates.
        epsilon: A small constant for numerical stability.
        weight_decay: A `Tensor` or a floating point value, or a schedule
            that is a `tf.keras.optimizers.schedules.LearningRateSchedule`.
            Weight decay for each parameter.
        rectify: boolean. Whether to enable rectification as in RectifiedAdam
        amsgrad: boolean. Whether to apply AMSGrad variant of this
            algorithm from the paper "On the Convergence of Adam and
            beyond".
        sma_threshold. A float value.
            The threshold for simple mean average.
        total_steps: An integer. Total number of training steps.
            Enable warmup by setting a positive value.
        warmup_proportion: A floating point value.
            The proportion of increasing steps.
        min_lr: A floating point value. Minimum learning rate after warmup.
        name: Optional name for the operations created when applying
            gradients. Defaults to "AdaBeliefOptimizer".
        **kwargs: keyword arguments. Allowed to be {`clipnorm`,
            `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients
            by norm; `clipvalue` is clip gradients by value, `decay` is
            included for backward compatibility to allow time inverse
            decay of learning rate. `lr` is included for backward
            compatibility, recommended to use `learning_rate` instead.
    """
    super().__init__(name, **kwargs)

    # ------------------------------------------------------------------------------
    # Print modifications to default arguments
    #print(Fore.RED + 'Please check your arguments if you have upgraded adabelief-tf from version 0.0.1.')
    #print(Fore.RED + 'Modifications to default arguments:')
    #default_table = tabulate([
     #       ['adabelief-tf=0.0.1','1e-8','Not supported','Not supported'],
      #      ['Current version (0.1.0)','1e-14','supported','default: True']],
       #     headers=['eps','weight_decouple','rectify'])
    #print(Fore.RED + default_table)

    #print(Fore.RED +'For a complete table of recommended hyperparameters, see')
    #print(Fore.RED + 'https://github.com/juntang-zhuang/Adabelief-Optimizer')

    print(Style.RESET_ALL)
    # ------------------------------------------------------------------------------

    self._set_hyper("learning_rate", kwargs.get("lr", learning_rate))
    self._set_hyper("beta_1", beta_1)
    self._set_hyper("beta_2", beta_2)
    self._set_hyper("decay", self._initial_decay)
    self._set_hyper("weight_decay", weight_decay)
    self._set_hyper("sma_threshold", sma_threshold)
    self._set_hyper("total_steps", int(total_steps))
    self._set_hyper("warmup_proportion", warmup_proportion)
    self._set_hyper("min_lr", min_lr)
    self.epsilon = epsilon or tf.keras.backend.epsilon()
    self.amsgrad = amsgrad
    self.rectify = rectify
    self._has_weight_decay = weight_decay != 0.0
    self._initial_total_steps = total_steps

def _create_slots(self, var_list):
    for var in var_list:
        self.add_slot(var, "m")
    for var in var_list:
        self.add_slot(var, "v")
    for var in var_list:
        self.add_slot(var, "grad_dif")
    if self.amsgrad:
        for var in var_list:
            self.add_slot(var, "vhat")

def set_weights(self, weights):
    params = self.weights
    num_vars = int((len(params) - 1) / 2)
    if len(weights) == 4 * num_vars + 1:
        weights = weights[: len(params)]
    super().set_weights(weights)

def _decayed_wd(self, var_dtype):
    wd_t = self._get_hyper("weight_decay", var_dtype)
    if isinstance(wd_t, tf.keras.optimizers.schedules.LearningRateSchedule):
        wd_t = tf.cast(wd_t(self.iterations), var_dtype)
    return wd_t

def _resource_apply_dense(self, grad, var):
    var_dtype = var.dtype.base_dtype
    lr_t = self._decayed_lr(var_dtype)
    wd_t = self._decayed_wd(var_dtype)
    m = self.get_slot(var, "m")
    v = self.get_slot(var, "v")
    beta_1_t = self._get_hyper("beta_1", var_dtype)
    beta_2_t = self._get_hyper("beta_2", var_dtype)
    epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
    local_step = tf.cast(self.iterations + 1, var_dtype)
    beta_1_power = tf.math.pow(beta_1_t, local_step)
    beta_2_power = tf.math.pow(beta_2_t, local_step)

    if self._initial_total_steps > 0:
        total_steps = self._get_hyper("total_steps", var_dtype)
        warmup_steps = total_steps * self._get_hyper("warmup_proportion", var_dtype)
        min_lr = self._get_hyper("min_lr", var_dtype)
        decay_steps = tf.maximum(total_steps - warmup_steps, 1)
        decay_rate = (min_lr - lr_t) / decay_steps
        lr_t = tf.where(
            local_step <= warmup_steps,
            lr_t * (local_step / warmup_steps),
            lr_t + decay_rate * tf.minimum(local_step - warmup_steps, decay_steps),
        )

    sma_inf = 2.0 / (1.0 - beta_2_t) - 1.0
    sma_t = sma_inf - 2.0 * local_step * beta_2_power / (1.0 - beta_2_power)

    m_t = m.assign(
        beta_1_t * m + (1.0 - beta_1_t) * grad, use_locking=self._use_locking
    )
    m_corr_t = m_t / (1.0 - beta_1_power)

    grad_dif = self.get_slot(var,'grad_dif')
    grad_dif.assign( grad - m_t )
    v_t = v.assign(
        beta_2_t * v + (1.0 - beta_2_t) * tf.math.square(grad - m_t) + epsilon_t,
        use_locking=self._use_locking,
    )

    if self.amsgrad:
        vhat = self.get_slot(var, "vhat")
        vhat_t = vhat.assign(tf.maximum(vhat, v_t), use_locking=self._use_locking)
        v_corr_t = tf.math.sqrt(vhat_t / (1.0 - beta_2_power))
    else:
        vhat_t = None
        v_corr_t = tf.math.sqrt(v_t / (1.0 - beta_2_power))

    r_t = tf.math.sqrt(
        (sma_t - 4.0)
        / (sma_inf - 4.0)
        * (sma_t - 2.0)
        / (sma_inf - 2.0)
        * sma_inf
        / sma_t
    )

    if self.rectify:
        sma_threshold = self._get_hyper("sma_threshold", var_dtype)
        var_t = tf.where(
            sma_t >= sma_threshold,
            r_t * m_corr_t / (v_corr_t + epsilon_t),
            m_corr_t,
        )
    else:
        var_t = m_corr_t / (v_corr_t + epsilon_t)

    if self._has_weight_decay:
        var_t += wd_t * var

    var_update = var.assign_sub(lr_t * var_t, use_locking=self._use_locking)

    updates = [var_update, m_t, v_t]
    if self.amsgrad:
        updates.append(vhat_t)
    return tf.group(*updates)

def _resource_apply_sparse(self, grad, var, indices):
    var_dtype = var.dtype.base_dtype
    lr_t = self._decayed_lr(var_dtype)
    wd_t = self._decayed_wd(var_dtype)
    beta_1_t = self._get_hyper("beta_1", var_dtype)
    beta_2_t = self._get_hyper("beta_2", var_dtype)
    epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
    local_step = tf.cast(self.iterations + 1, var_dtype)
    beta_1_power = tf.math.pow(beta_1_t, local_step)
    beta_2_power = tf.math.pow(beta_2_t, local_step)

    if self._initial_total_steps > 0:
        total_steps = self._get_hyper("total_steps", var_dtype)
        warmup_steps = total_steps * self._get_hyper("warmup_proportion", var_dtype)
        min_lr = self._get_hyper("min_lr", var_dtype)
        decay_steps = tf.maximum(total_steps - warmup_steps, 1)
        decay_rate = (min_lr - lr_t) / decay_steps
        lr_t = tf.where(
            local_step <= warmup_steps,
            lr_t * (local_step / warmup_steps),
            lr_t + decay_rate * tf.minimum(local_step - warmup_steps, decay_steps),
        )

    sma_inf = 2.0 / (1.0 - beta_2_t) - 1.0
    sma_t = sma_inf - 2.0 * local_step * beta_2_power / (1.0 - beta_2_power)

    m = self.get_slot(var, "m")
    m_scaled_g_values = grad * (1 - beta_1_t)
    m_t = m.assign(m * beta_1_t, use_locking=self._use_locking)
    m_t = self._resource_scatter_add(m, indices, m_scaled_g_values)
    m_corr_t = m_t / (1.0 - beta_1_power)

    grad_dif = self.get_slot(var,'grad_dif')
    grad_dif.assign(m_t)
    grad_dif = self._resource_scatter_add(grad_dif, indices, -1.0 * grad)

    v = self.get_slot(var, "v")
    m_t_indices = tf.gather(m_t, indices)
    v_scaled_g_values = tf.math.square(grad - m_t_indices) * (1 - beta_2_t)
    v_t = v.assign(v * beta_2_t + epsilon_t, use_locking=self._use_locking)
    v_t = self._resource_scatter_add(v, indices, v_scaled_g_values)

    if self.amsgrad:
        vhat = self.get_slot(var, "vhat")
        vhat_t = vhat.assign(tf.maximum(vhat, v_t), use_locking=self._use_locking)
        v_corr_t = tf.math.sqrt(vhat_t / (1.0 - beta_2_power))
    else:
        vhat_t = None
        v_corr_t = tf.math.sqrt(v_t / (1.0 - beta_2_power))

    r_t = tf.math.sqrt(
        (sma_t - 4.0)
        / (sma_inf - 4.0)
        * (sma_t - 2.0)
        / (sma_inf - 2.0)
        * sma_inf
        / sma_t
    )

    if self.rectify:
        sma_threshold = self._get_hyper("sma_threshold", var_dtype)
        var_t = tf.where(
            sma_t >= sma_threshold,
            r_t * m_corr_t / (v_corr_t + epsilon_t),
            m_corr_t,
        )
    else:
        var_t = m_corr_t / (v_corr_t + epsilon_t)

    if self._has_weight_decay:
        var_t += wd_t * var

    var_update = self._resource_scatter_add(
        var, indices, tf.gather(-lr_t * var_t, indices)
    )

    updates = [var_update, m_t, v_t]
    if self.amsgrad:
        updates.append(vhat_t)
    return tf.group(*updates)

def get_config(self):
    config = super().get_config()
    config.update(
        {
            "learning_rate": self._serialize_hyperparameter("learning_rate"),
            "beta_1": self._serialize_hyperparameter("beta_1"),
            "beta_2": self._serialize_hyperparameter("beta_2"),
            "decay": self._serialize_hyperparameter("decay"),
            "weight_decay": self._serialize_hyperparameter("weight_decay"),
            "sma_threshold": self._serialize_hyperparameter("sma_threshold"),
            "epsilon": self.epsilon,
            "amsgrad": self.amsgrad,
            "rectify": self.rectify,
            "total_steps": self._serialize_hyperparameter("total_steps"),
            "warmup_proportion": self._serialize_hyperparameter(
                "warmup_proportion"
            ),
            "min_lr": self._serialize_hyperparameter("min_lr"),
        }
    )
    return config

generator_g_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5) generator_f_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)

discriminator_x_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5) discriminator_y_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)

ManoharSai2000 commented 3 years ago

The first part is the Ada-Belief Source Code and the second part is its usage

cryu854 commented 3 years ago

Hi @ManoharSai2000, @sumanthsadhu. I tried the example of tensorflow cycleGAN here, and used Adam and Adabelief to train, took me 490 sec and 560 sec for one epoch respectively on Tesla T4 in google colab. In my opinion, since Adam is further optimized by fused kernels and Adabelief is implemented only using tensorflow ops and then wrapped as keras, so it is inevitable that there will be an efficiency gap.

ManoharSai2000 commented 3 years ago

@cryu854, Ok, Thank You. Is this issue same in pytorch? I hope, the optimizer will be added soon to tensorflow as you mentioned.

cryu854 commented 3 years ago

@ManoharSai2000. To my knowledge, pytorch seems to need an additional compiler like JIT to fuse kernel automatically. Otherwise, pytorch will launch a separate kernel for each operation. Please correct me if I'm wrong.

juntang-zhuang commented 3 years ago

@cryu854 Thanks for the update. I just found that the code uses a slot called "grad_dif", but seems it's not used, https://github.com/juntang-zhuang/Adabelief-Optimizer/blob/f870a0139bd145cbc788f5d9f7a66006ad51c550/pypi_packages/adabelief_tf0.1.0/adabelief_tf/AdaBelief_tf.py#L284, it might also cause some overload computation. So I updated a new version to a new branch called "update_0.2.0", please see the new code https://github.com/juntang-zhuang/Adabelief-Optimizer/blob/ce188ee2d8c8afc72810374a0fbbe7309f9658f9/pypi_packages/adabelief_tf0.2.0/adabelief_tf/AdaBelief_tf.py Other update includes an option to choose to turn on or off the warning messages in red. Could you perform a quick check or test? Perhaps with tools such as text-compare.com to better identify the exact changes. If everything works fine, we can push it to pip. Thanks a lot in advance.

cryu854 commented 3 years ago

@juntang-zhuang The new code looks good to me, and it passes all the test cases in Adabelief_test.py. Btw, should we move Adabelief_test.py out of the folder? I am not sure whether the pypi package will include the test code.

juntang-zhuang commented 3 years ago

@cryu854 Thanks a lot. I just deleted the test code, and uploaded it to pip 0.2.0. BTW, do you want to add your name and email at the beginning of the file as a contributor? If so, I'll update it in version 0.2.1. Thanks again for your efforts and help.

juntang-zhuang commented 3 years ago

@ManoharSai2000 @sumanthsadhu Just removed some redundancy computation in the code, and released adabelief-tf==0.2.0, please try it from pip pip install adabelief-tf==0.2.0, should be a little bit faster now. Though we did not perform fused kernel operation. Source code is in pypi_packages/adabelief_tf0.2.0

cryu854 commented 3 years ago

@juntang-zhuang Yes, if it won't bother you, it would be an honor for me to be as a part of the contributions. Thank you in advance.

juntang-zhuang commented 3 years ago

@cryu854 Could you start a pull request, and input your contact info in the code? Perhaps also change the version to 0.2.1, so I can upload it to pip. Thanks a lot.