Closed wuliytTaotao closed 4 years ago
cc code owner: @PhilJd
When I run the above code on the CPU only, no error is reported. But another problem arises, learning rate decay and weight decay do not work.
I found that when using model.fit(), tf.optimizers.schedules.PiecewiseConstantDecay should be used as a parameter to learning_rate like below:
schedule = tf.optimizers.schedules.PiecewiseConstantDecay(
[1407*20, 1407*30], [1e-3 1e-4, 1e-5])
optimizer = tf.keras.optimizers.Adam(learning_rate=schedule)
model.compile(optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=40, validation_split=0.1)
So I tried to use AdamW as well, learning rate decay works, but the weight decay doesn't work:
step = tf.Variable(0, trainable=False)
schedule = tf.optimizers.schedules.PiecewiseConstantDecay(
[1407*20, 1407*30], [1e-3 1e-4, 1e-5])
wd = lambda: 1e-1 * schedule(step)
# weight decay cannot be changed with schedule
optimizer = tf.keras.optimizers.AdamW(learning_rate=schedule, weight_decay=wd)
model.compile(optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=40, validation_split=0.1)
When weight decay doesn't change with learning rate schedule, learning curve may be like this:
Hope someone can tell me how to do it right, thanks!
It seems like keras treats instances of learning_rate_schedule.LearningRateSchedule
separately (in _get_hyper)
Could you try to create a second schedule and see if that works? I.e., something along the lines:
schedule_lr = tf.optimizers.schedules.PiecewiseConstantDecay(
[1407*20, 1407*30], [1e-3, 1e-4, 1e-5])
schedule_wd = tf.optimizers.schedules.PiecewiseConstantDecay(
[1407*20, 1407*30], [1e-4, 1e-5, 1e-6])
optimizer = tf.keras.optimizers.AdamW(learning_rate=schedule_lr, weight_decay=schedule_wd)
Thanks :)
It seems like keras treats instances of
learning_rate_schedule.LearningRateSchedule
separately (in _get_hyper)Could you try to create a second schedule and see if that works? I.e., something along the lines:
schedule_lr = tf.optimizers.schedules.PiecewiseConstantDecay( [1407*20, 1407*30], [1e-3, 1e-4, 1e-5]) schedule_wd = tf.optimizers.schedules.PiecewiseConstantDecay( [1407*20, 1407*30], [1e-4, 1e-5, 1e-6]) optimizer = tf.keras.optimizers.AdamW(learning_rate=schedule_lr, weight_decay=schedule_wd)
Thanks :)
It doesn't work:
Traceback (most recent call last):
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/tensor_util.py", line 324, in _AssertCompatible
fn(values)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/tensor_util.py", line 263, in inner
_ = [_check_failed(v) for v in nest.flatten(values)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/tensor_util.py", line 264, in <listcomp>
if not isinstance(v, expected_types)]
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/tensor_util.py", line 248, in _check_failed
raise ValueError(v)
ValueError: <tensorflow.python.keras.optimizer_v2.learning_rate_schedule.PiecewiseConstantDecay object at 0x7f57d72eef60>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tmp2.py", line 48, in <module>
model.fit(x_train, y_train, epochs=40, validation_split=0.1)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
total_epochs=epochs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
self._initialize(args, kwds, add_initializers_to=initializer_map)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
*args, **kwds))
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
return weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
per_replica_function, args=(model, x, y, sample_weights))
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2132, in _call_for_each_replica
return fn(*args, **kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
return func(*args, **kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 264, in train_on_batch
output_loss_metrics=model._output_loss_metrics)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 311, in train_on_batch
output_loss_metrics=output_loss_metrics))
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 272, in _process_single_batch
model.optimizer.apply_gradients(zip(grads, trainable_weights))
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_addons/optimizers/weight_decay_optimizers.py", line 153, in apply_gradients
grads_and_vars, name=name)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 441, in apply_gradients
kwargs={"name": name})
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1917, in merge_call
return self._merge_call(merge_fn, args, kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1924, in _merge_call
return merge_fn(self._strategy, *args, **kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 485, in _distributed_apply
var, apply_grad_to_update_var, args=(grad,), group=False))
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1530, in update
return self._update(var, fn, args, kwargs, group)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2142, in _update
return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2148, in _update_non_slot
result = fn(*args, **kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 467, in apply_grad_to_update_var
update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_addons/optimizers/weight_decay_optimizers.py", line 173, in _resource_apply_dense
with tf.control_dependencies([self._decay_weights_op(var)]):
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_addons/optimizers/weight_decay_optimizers.py", line 158, in _decay_weights_op
self._get_hyper('weight_decay', var.dtype) * var,
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/ops/variables.py", line 1079, in _run_op
return tensor_oper(a.value(), *args, **kwargs)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/ops/math_ops.py", line 924, in r_binary_op_wrapper
x = ops.convert_to_tensor(x, dtype=y.dtype.base_dtype, name="x")
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 1184, in convert_to_tensor
return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 1242, in convert_to_tensor_v2
as_ref=False)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 1296, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/constant_op.py", line 286, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
allow_broadcast=True)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/constant_op.py", line 265, in _constant_impl
allow_broadcast=allow_broadcast))
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/tensor_util.py", line 449, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/home/yetao/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/tensor_util.py", line 331, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).__name__))
TypeError: Expected float32, got <tensorflow.python.keras.optimizer_v2.learning_rate_schedule.PiecewiseConstantDecay object at 0x7f57d72eef60> of type 'PiecewiseConstantDecay' instead.
It says weight decay needs to be float32, rather than PiecewiseConstantDecay object, but why learning rate could be?
And I saw someplace implementing the weight decay with learning rate schedule by $wd_t = wd* lr_t / lr$, this seems like a good way to implement it, but I'm not familiar with the implementation of TF2.0.
Thanks for trying! I hope to find some time on the weekend to take a closer look.
I've avoided the model.fit
function so far as I feel it does too much under the hood but I guess now's the time to dive in ;)
Inspire by https://github.com/sajadn/AdamW/blob/master/DecoupleWeightDecay.py, I find a way using callback to monitor the weight decay along with the learning rate schedule on the begin of each epoch, and the code below can implement the AdamW with learning rate schelude on epochs (not each update):
import tensorflow as tf
import os
from tensorflow_addons.optimizers import AdamW
import numpy as np
from tensorflow.python.keras import backend as K
from tensorflow.python.util.tf_export import keras_export
from tensorflow.keras.callbacks import Callback
def lr_schedule(epoch):
"""Learning Rate Schedule
Learning rate is scheduled to be reduced after 20, 30 epochs.
Called automatically every epoch as part of callbacks during training.
# Arguments
epoch (int): The number of epochs
# Returns
lr (float32): learning rate
"""
lr = 1e-3
if epoch >= 30:
lr *= 1e-2
elif epoch >= 20:
lr *= 1e-1
print('Learning rate: ', lr)
return lr
def wd_schedule(epoch):
"""Weight Decay Schedule
Weight decay is scheduled to be reduced after 20, 30 epochs.
Called automatically every epoch as part of callbacks during training.
# Arguments
epoch (int): The number of epochs
# Returns
wd (float32): weight decay
"""
wd = 1e-4
if epoch >= 30:
wd *= 1e-2
elif epoch >= 20:
wd *= 1e-1
print('Weight decay: ', wd)
return wd
# just copy the implement of LearningRateScheduler, and then change the lr with weight_decay
@keras_export('keras.callbacks.WeightDecayScheduler')
class WeightDecayScheduler(Callback):
"""Weight Decay Scheduler.
Arguments:
schedule: a function that takes an epoch index as input
(integer, indexed from 0) and returns a new
weight decay as output (float).
verbose: int. 0: quiet, 1: update messages.
```python
# This function keeps the weight decay at 0.001 for the first ten epochs
# and decreases it exponentially after that.
def scheduler(epoch):
if epoch < 10:
return 0.001
else:
return 0.001 * tf.math.exp(0.1 * (10 - epoch))
callback = WeightDecayScheduler(scheduler)
model.fit(data, labels, epochs=100, callbacks=[callback],
validation_data=(val_data, val_labels))
"""
def __init__(self, schedule, verbose=0):
super(WeightDecayScheduler, self).__init__()
self.schedule = schedule
self.verbose = verbose
def on_epoch_begin(self, epoch, logs=None):
if not hasattr(self.model.optimizer, 'weight_decay'):
raise ValueError('Optimizer must have a "weight_decay" attribute.')
try: # new API
weight_decay = float(K.get_value(self.model.optimizer.weight_decay))
weight_decay = self.schedule(epoch, weight_decay)
except TypeError: # Support for old API for backward compatibility
weight_decay = self.schedule(epoch)
if not isinstance(weight_decay, (float, np.float32, np.float64)):
raise ValueError('The output of the "schedule" function '
'should be float.')
K.set_value(self.model.optimizer.weight_decay, weight_decay)
if self.verbose > 0:
print('\nEpoch %05d: WeightDecayScheduler reducing weight '
'decay to %s.' % (epoch + 1, weight_decay))
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
logs['weight_decay'] = K.get_value(self.model.optimizer.weight_decay)
if name == 'main': os.environ["CUDA_VISIBLE_DEVICES"] = '1'
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, enable=True)
print(gpus)
cifar10 = tf.keras.datasets.cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(16, (3, 3), padding='same', activation='relu', input_shape=(32, 32, 3)),
tf.keras.layers.AveragePooling2D(),
tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu'),
tf.keras.layers.AveragePooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(10, activation='softmax')
])
optimizer = AdamW(learning_rate=lr_schedule(0), weight_decay=wd_schedule(0))
# optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
tb_callback = tf.keras.callbacks.TensorBoard(os.path.join('logs', 'adamw'),
profile_batch=0)
lr_callback = tf.keras.callbacks.LearningRateScheduler(lr_schedule)
wd_callback = WeightDecayScheduler(wd_schedule)
model.compile(optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=40, validation_split=0.1,
callbacks=[tb_callback, lr_callback, wd_callback])
model.evaluate(x_test, y_test, verbose=2)
This can be a simple example of using AdamW with tf.keras.
But if someone want to use learning rate decay every update of weights, like tf.optimizers.schedules.PiecewiseConstantDecay, it cannot be achieved with the code above.
@PhilJd Thanks!
It seems like keras treats instances of
learning_rate_schedule.LearningRateSchedule
separately (in _get_hyper)Could you try to create a second schedule and see if that works? I.e., something along the lines:
schedule_lr = tf.optimizers.schedules.PiecewiseConstantDecay( [1407*20, 1407*30], [1e-3, 1e-4, 1e-5]) schedule_wd = tf.optimizers.schedules.PiecewiseConstantDecay( [1407*20, 1407*30], [1e-4, 1e-5, 1e-6]) optimizer = tf.keras.optimizers.AdamW(learning_rate=schedule_lr, weight_decay=schedule_wd)
Thanks :)
I would think we have to do something like this to weight_decay
if we want to pass an instance of LearningRateSchedule
into it.
@WindQAQ I agree with you. If we do not establish a connection between weight_decay
and learning_rate
through initial values such as $wdt = wd{init} \cdot lrt / lr{init}$, then we must do the same schedule for both, but the current code does not support the case where weight_decay
is learning_rate_schedule.LearningRateSchedule
. I also think that weight_decay
needs to support the typelearning_rate_schedule.LearningRateSchedule
.
BTW, the callback method mentioned above can be used normally.
@WindQAQ I agree with you. If we do not establish a connection between
weight_decay
andlearning_rate
through initial values such as $wdt = wd{init} \cdot lrt / lr{init}$, then we must do the same schedule for both, but the current code does not support the case whereweight_decay
islearning_rate_schedule.LearningRateSchedule
. I also think thatweight_decay
needs to support the typelearning_rate_schedule.LearningRateSchedule
.BTW, the callback method mentioned above can be used normally.
Agree +1. As there is another request in #865, @PhilJd do you think we should support decaying weight_decay
param? Thank you.
@wuliytTaotao @WindQAQ Hi, when mixprecision training ,the code above with WD scheduler doesn't work
in on_epoch_begin(self, epoch, logs) 13 def on_epoch_begin(self, epoch, logs=None): 14 if not hasattr(self.model.optimizer, 'weight_decay'): ---> 15 raise ValueError('Optimizer must have a "weight_decay" attribute.') 16 try: # new API 17 weight_decay = float(K.get_value(self.model.optimizer.weight_decay))
ValueError: Optimizer must have a "weight_decay" attribute.
and because of :
optimizer.weight_decay
<tf.Variable 'weight_decay:0' shape=() dtype=float64, numpy=0.0002>
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.optimizer
<tensorflow.python.keras.mixed_precision.experimental.loss_scale_optimizer.LossScaleOptimizer at 0x29619c8a2c8>
and by the way : per step update WD and lr for ADAM is unnessasary ,because ADAM can adjust lr automatically inside an epoch. and WD is aimed to "Decouple Weight Decay Regularization" (original paper)with loss function and lr. above all ,with epoch level update is more than sufficient.
@AlexWang1900 Please list your code completely and give your TF version.
@wuliytTaotao
tf.keras.mixed_precision.experimental.set_policy('mixed_float16')
optimizer = tfa.optimizers.AdamW(learning_rate=lr_schedule(0), weight_decay=wd_schedule(0),amsgrad=False)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.optimizer
<tensorflow.python.keras.mixed_precision.experimental.loss_scale_optimizer.LossScaleOptimizer at 0x29619c8a2c8>
Anything moved regarding the bug with GPU ?
Hey guys,
While facing the similar issue, until there is a PR, here is a workaround I found that works for either .fit()
(classic keras usage) or in a custom use of the AdamW
object.
I first create the AdamW
object as opt
then assign a lambda
function returning the value of wd_schedule(opt.iterations)
as weight_decay
attribute. This allows to update the weight decay value commonly with the optimizer's number of iterations.
Here is a snippet of code for the case of training scheme using .fit()
:
lr_schedule = tf.optimizers.schedules.ExponentialDecay(1e-4, 100, 0.9)
wd_schedule = tf.optimizers.schedules.ExponentialDecay(5e-5, 100, 0.9)
opt = AdamW(learning_rate=lr_schedule, weight_decay=lambda : None)
opt.weight_decay = lambda : wd_schedule(opt.iterations)
mlp.compile(
optimizer=opt,
loss=tf.keras.losses.BinaryCrossentropy())
If I create a tf.keras.callback.CallBack
to ensure that the value of weight decay do change:
class DecayHistory(tf.keras.callbacks.Callback):
def on_train_begin(self, logs={}):
self.lr = []
self.wd = []
def on_batch_end(self, batch, logs={}):
self.lr.append(self.model.optimizer.lr(self.model.optimizer.iterations))
self.wd.append(self.model.optimizer.weight_decay)
I obtain the expected behavior as shown in the following plot :
PS : @wuliytTaotao 's solution can be updated at each step by using on_batch_end()
instead of on_epoch_end()
lr_schedule = tf.optimizers.schedules.ExponentialDecay(1e-4, 100, 0.9) I obtain the expected behavior as shown in the following plot :
You specify 100 decay steps in the code, but in the plot decay continues for the entire plot range (more than 2000 steps). Can you clarify this discrepancy or update the code, please?
lr_schedule = tf.optimizers.schedules.ExponentialDecay(1e-4, 100, 0.9) I obtain the expected behavior as shown in the following plot :
You specify 100 decay steps in the code, but in the plot decay continues for the entire plot range (more than 2000 steps). Can you clarify this discrepancy or update the code, please?
Hi,
This is inherent to the way tf.optimizers.schedules.ExponentialDecay
is built .
Indeed if you look at the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/ExponentialDecay), how decay_steps
work is not very clear.
Though what it does is that the decay_steps
correspond to how many step it takes to to get from a learning rate lr
to a learning rate of value decay_rate * lr
.
To have a concrete example, lets take the parameters of the learning rate scheduler above with initial_learning_rate = 1e-4
, decay_steps = 100
, decay_rate
= 0.9 :
learning_rate_step_100 = 0.9 * initial_learning_rate
learning_rate_step_200 = 0.9 * learning_rate_step_100
And so on...Contrary to some other schedulers (such as Cosine scheduler) , Exponential Decay is infinite. The general formula if staircase=False
is :
lr(step) = (decay_rate ** (step / decay_steps) )* initial_learning_rate
Many thanks!
One can also follow this https://github.com/tensorflow/addons/pull/1974 to make AdamW
support scheduler. Feel free to open an PR and request my review if anyone is interested in it. Thanks.
@hugoych While, your selected solution works for the schedule, it doesn't allow the optimizier to be serialized anymore, due to this line:
With your solution, the parameter is a callable, but it returns a tensor - following the function self. _serialize_hyperparameter:
A callable is resolved differently, than a tensor or a function - the fix is to revert the order of operations (first resolve the callable, then check if it's a tensor or a custom object (e.g.: in this case a learning rate scheduler)
Until the proper schedules are implemented, this solution can be used in conjunction with yours (this is for SGDW, but the same can be done for AdamW)
`
class SerializableSGDW(tfa.optimizers.SGDW):
def get_config(self):
config = tf.keras.optimizers.SGD.get_config(self)
config.update(
{"weight_decay": self._fixed_serialize_hyperparameter("weight_decay"),}
)
return config
def _fixed_serialize_hyperparameter(self, hyperparameter_name):
"""Serialize a hyperparameter that can be a float, callable, or Tensor."""
value = self._hyper[hyperparameter_name]
# First resolve the callable
if callable(value):
value = value()
if isinstance(value, tf.keras.optimizers.schedules.LearningRateSchedule):
return tf.keras.optimizers.schedules.serialize(value)
if tensor_util.is_tensor(value):
return backend.get_value(value)
return value
`
Note however, after loading the model weight_decay will be variable, and no longer the scheduler
@MHStadler Can we use weight_decay scheduling now? How can I check it's working? The reason I want to check this is that AdamW isn't working properly. It takes about two times longer than when I used SGD and the result is very poor. I'm using tfa-nightly with Colab default enviornment. Thank you. (Should I post this on a new Issue with more details?)
Rather than have to schedule weight decay manually, why don't we just multiply weight decay by the learning rate? This would scale weight decay automatically. This is what they do in PyTorch and in the Fast.ai library.
Here's how TFA currently implements weight decay:
def _decay_weights_op(self, var, apply_state=None):
... # I put ... to hide irrelevant code
return var.assign_sub(coefficients["wd_t"] * var, self._use_locking) # Line 182
...
I am proposing instead doing:
return var.assign_sub(coefficients["lr"] * coefficients["wd_t"] * var, self._use_locking) # Scales wd by learning rate
Then there would be no need to manually schedule weight decay to match the learning rate schedule.
This is what they do in AdamW
for PyTorch:
...
param.mul_(1 - lr * weight_decay)
and also in Fast.ai:
if do_wd and wd!=0: p.data.mul_(1 - lr*wd)
This would simplify things greatly.
If you read the original AdamW paper, a large part of the motivation behind this weight decay was to decouple the applied weight decay from the learning rate Scaling the decay by the learning rate is a common error that defeats the purpose of the originally proposed algorithm
@MHStadler You are wrong. The motivation in AdamW was to decouple the wd from gradients. Before, the weights with larger gradients were decayed less (due to larger denominator) which is an undesired behaviour. After the fix all weights are decayed equally.
But another problem arises. As you decrease your LR, the regularisation strength of WD increases because model needs orders of magnitude larger gradients just to keep weights the same and prevent their decay, this is again undesired behaviour. In the first image in this thread you could see an example of such behaviour. When LR is decreased, weight decays starts to decay already trained weights and this leads to decrease in accuracy. Coupling weight decay with learning rate would remove this problem: the smaller is the speed of weights change (when lr is low), the smaller is weight decay. So what proposed @bwolfson97 is not "a common error", but a better implementation.
I quote from an email I got from the authors of the paper:
"i) Please verify that AdamW that you are using is consistent with the algorithm given in the paper. Some implementations only decouple the adaptive gradient update from the weight decay update but still keep the learning rate and weight decay coupled, e.g., somewhere in the code lr (scheduled or not) is multiplied by w"
Hey, just to give some context: As @MHStadler mentioned, weight decay is intentionally decoupled from the learning rate.
In the paper, both, weight decay and learning rate, are multiplied by a/the schedule. So multiplying by the learning rate would be just a proxy for that with the side effect that changing the learning rate (e.g. during hyper parameter search) also changes the regularization strength. One could prevent that and recover the scheduling values by storing the initial learning rate and dividing by it. However, if you'd then like to decouple wd and lr for some experiment and use different schedules, you need to actually undo the scheduling and apply your own schedule, which I'd rather avoid.
In the end, you'd have the same problem with the scheduling logic of Keras being too specialized towards learning rates. (Pytorch has the same problem, with the scheduling being targeted towards lr only. At some point there was some discussion to fix this but I'm not sure what's the status there).
As a compromise, because I realize that with the current Keras situation this is a bit of a pain and it's often reasonable to use the same schedule, I could imagine adding this behaviour guarded behing a multiply_weight_decay_by_lr=False
flag. WDYT?
@PhilJd I think that sounds like a reasonable compromise, though I still think that this decoupling is vital to get the best out of AdamW (as shown by section 4.2 of the original paper)
While you are right that in theory one could use different schedules for LR and weight decay, on practice it almost never happens and the schedule is the same, so it's convenient to implement same scheduling by multiplying weight decay by LR. In my head they are still decoupled
, but user have to be aware about such behaviour and do the math themselves. While it's possible to remember initial lr and then use it to determine wd as wd_t = wd_0 * lr_t / lr_0
but this could lead to some unexpected bugs (for example I often initialise lr with 0 and then rely on LR Schedulers to set a proper value, but I mostly use PyTorch rather than TF).
The compromise with multiply_weight_decay_by_lr
proposed by @PhilJd works well in my opinion. Because if user is aware enough to enable this option he also probably understands the relationship between LR and WD and could do the math.
The point of the separation is not only to potentially have different schedules, but to create "a more separable hyperparameter space". This is what ultimately allowed hyper parameter combinations that enabled AdamW to outperform SGD with momentum
If you check Figure 2 in the paper, you can see the difference between the coupled and decoupled version for the same hyper parameters. Multiplying the WD by the LR will always fluctuate the effectively applied WD, unless we apply the scaling by the orignal LR as pointed out by @PhilJd
However, if I understood his suggestion correctly then the flag would only enable/disable WD * LR, but not include the scaling one way or another, but maybe I'm wrong. If I'm correct, then the flag would just allow you to disregard the author's advice about the decoupling, with the drawback of potentially worse performance in exchange for an easier time with the scheduling.
@PhilJd If the intention is to include the scaling by LR factor if the flag is set to true (effectively decoupling the LR from the WD, but allowing the same schedule), then I think you should consider having the default set to True instead of False
And maybe name the parameter something like multiply_weight_decay_by_lr_ratio
Multiplying the WD by the LR will always fluctuate the effectively applied WD
This is what i call "user can do the math". If you want real weight decay of 1e-5 and train with lr=1e-3, then the value of wd passed to the optimizer should be 1e-2. While this requires some additional thinking while choosing hyper-parameters, lr and wd are not coupled
in the same terms they are coupled
on figure 2 in the original paper.
My understanding was that setting the proposed flag to True would enable the same behaviour as currently in PyTorch:
p *= 1 - group['lr'] * group['weight_decay']
and setting it to False will keep the current behaviour
p *= 1 - group['weight_decay']
AdamW still not serializable, can't save my Model with model.save()
System information
Describe the bug
There seems to be no examples showing how to use AdamW with learning rate scheduler normally, so I try to use AdamW like the code below. The code is correct with adam, but with AdamW with learning rate decay, it doesn't work.
Can anyone give a right example using the AdamW with learning rate decay?
Code to reproduce the issue
Other info / logs