kyunghyuncho / NMT

1 stars 1 forks source link

Remove NaN from the gradient. (fix #22) #24

Closed ejls closed 9 years ago

ejls commented 9 years ago

I'll also use this branch to update the code to the last version of blocks/fuel.

kyunghyuncho commented 9 years ago

@orhanf @sebastien-j can you review and merge it?

ejls commented 9 years ago

Well, the thing is, NaN ⊡ x is always false for all . In groundhog, tmpg = TT.switch(TT.ge(norm_gs, c), g*c/norm_gs, g) means if there is a single NaN, norm_gs is NaN and no clipping will be performed. The NaN are removed afterward.

In blocks, if we put StepClipping first :

        norm = l2_norm(previous_steps.values())
        multiplier = tensor.switch(norm < self.threshold,
                                   1, self.threshold / norm)
        steps = OrderedDict(
            (param, step * multiplier)
            for param, step in previous_steps.items())

Everything is divided by norm as soon as norm reaches NaN, so we get NaN everywhere (which are then all replaced by 0.1 * parameter).

If we want to match groundhog exactly, I can write a custom StepRule, since the current PR will clip 0.1*parameter (which is not done by groundhog).

orhanf commented 9 years ago

I think if we want to match exactly to GroundHog implementation (which i guess is the safe choice), we should write a custom StepRule as you suggested. So combining them together in a single step rule and fixing NaN issue will result the following StepRule that i am using right now:

class StepClippingWithRemoveNotFinite(StepRule):

    def __init__(self, threshold=None, scale=0.1):
        if threshold:
            self.threshold = shared_floatx(threshold)
        self.scale = scale

    def compute_steps(self, previous_steps):
        if not hasattr(self, 'threshold'):
            return previous_steps
        norm = l2_norm(previous_steps.values())
        multiplier = tensor.switch(tensor.ge(norm, self.threshold),
                                   self.threshold / norm, 1)
        notfinite = tensor.or_(tensor.isnan(norm), tensor.isinf(norm))
        steps = OrderedDict(
            (param, tensor.switch(
                notfinite, step * self.scale, step * multiplier))
            for param, step in previous_steps.items())
        return steps, []

Any flaws you see?

ejls commented 9 years ago
(param, tensor.switch(notfinite, step * self.scale, step * multiplier))

Should be :

(param, tensor.switch(notfinite, param * self.scale, step * multiplier))

Otherwise, I think the use of ifelse is prefered to switch, in practice it'll only allow the bypassing of threshold / norm though, so yeah, not important.

orhanf commented 9 years ago

Nice catch, thanks. Can you train a model with this step rule? We've observed some unexpected behaviors when ifelse is used with gpu, i would suggest sticking to version with switch.

ejls commented 9 years ago

On Thu, May 21, 2015 at 03:05:13PM -0700, Orhan Firat wrote:

Nice catch, thanks. Can you train a model with this step rule? It's running on bart13 (bokeh on :5006). Logs at /data/lisatmp3/simonet/wmt15/models/debug6

We've observed some unexpected behaviors when ifelse is used with gpu, i would suggest sticking to version with switch. Ok, I used switch.

-- Étienne

rizar commented 9 years ago

Hold on guys, I don't understand why StepClipping followed by RemoveNotFinite is bad for you.

The only thing you should want, that if gradient contains any NaNs, not update is done. Use scaler=0.0 for RemoveNotFinite and this is what you will get, isn't it?

I understand you desire to copy all quirks from Groundhog for sake of safety, but some of them definitely can be (a) implemented differently (b) discarded.