Closed ejls closed 9 years ago
@orhanf @sebastien-j can you review and merge it?
Well, the thing is, NaN ⊡ x
is always false for all ⊡
.
In groundhog, tmpg = TT.switch(TT.ge(norm_gs, c), g*c/norm_gs, g)
means if there is a single NaN, norm_gs
is NaN and no clipping will be performed. The NaN are removed afterward.
In blocks, if we put StepClipping first :
norm = l2_norm(previous_steps.values())
multiplier = tensor.switch(norm < self.threshold,
1, self.threshold / norm)
steps = OrderedDict(
(param, step * multiplier)
for param, step in previous_steps.items())
Everything is divided by norm
as soon as norm
reaches NaN
, so we get NaN
everywhere (which are then all replaced by 0.1 * parameter).
If we want to match groundhog exactly, I can write a custom StepRule
, since the current PR will clip 0.1*parameter
(which is not done by groundhog).
I think if we want to match exactly to GroundHog implementation (which i guess is the safe choice), we should write a custom StepRule
as you suggested. So combining them together in a single step rule and fixing NaN issue will result the following StepRule
that i am using right now:
class StepClippingWithRemoveNotFinite(StepRule):
def __init__(self, threshold=None, scale=0.1):
if threshold:
self.threshold = shared_floatx(threshold)
self.scale = scale
def compute_steps(self, previous_steps):
if not hasattr(self, 'threshold'):
return previous_steps
norm = l2_norm(previous_steps.values())
multiplier = tensor.switch(tensor.ge(norm, self.threshold),
self.threshold / norm, 1)
notfinite = tensor.or_(tensor.isnan(norm), tensor.isinf(norm))
steps = OrderedDict(
(param, tensor.switch(
notfinite, step * self.scale, step * multiplier))
for param, step in previous_steps.items())
return steps, []
Any flaws you see?
(param, tensor.switch(notfinite, step * self.scale, step * multiplier))
Should be :
(param, tensor.switch(notfinite, param * self.scale, step * multiplier))
Otherwise, I think the use of ifelse is prefered to switch, in practice it'll only allow the bypassing of threshold / norm
though, so yeah, not important.
Nice catch, thanks. Can you train a model with this step rule?
We've observed some unexpected behaviors when ifelse
is used with gpu, i would suggest sticking to version with switch
.
On Thu, May 21, 2015 at 03:05:13PM -0700, Orhan Firat wrote:
Nice catch, thanks. Can you train a model with this step rule? It's running on bart13 (bokeh on :5006). Logs at /data/lisatmp3/simonet/wmt15/models/debug6
We've observed some unexpected behaviors when
ifelse
is used with gpu, i would suggest sticking to version withswitch
. Ok, I used switch.
-- Étienne
Hold on guys, I don't understand why StepClipping
followed by RemoveNotFinite
is bad for you.
The only thing you should want, that if gradient contains any NaNs, not update is done. Use scaler=0.0
for RemoveNotFinite
and this is what you will get, isn't it?
I understand you desire to copy all quirks from Groundhog for sake of safety, but some of them definitely can be (a) implemented differently (b) discarded.
I'll also use this branch to update the code to the last version of blocks/fuel.