Some clean up and speed up

ejls commented 9 years ago

The --subtensor-fix parameter gives a speed up of 18-20% on training.

To check the correctness of the code I ran both algorithms in parallel :

class GradientDescent_Test(GradientDescent):
    def __init__(self, params, modified, *args, **kwargs):
        self.params = params
        self.modified = modified
        GradientDescent.__init__(self, params = params, *args, **kwargs)

    def initialize(self):
        GradientDescent.initialize(self)
        self.modified.initialize()

    def process_batch(self, batch):
        before = [param.get_value() for param in self.params]
        GradientDescent.process_batch(self, batch)
        originals = [param.get_value() for param in self.params]
        for param, val in zip(self.params, before):
            param.set_value(val)
        self.modified.process_batch(batch)
        for original, modified in zip(originals, self.params):
            m = numpy.abs(original - modified.get_value()).max()
            if m>0:
                try:
                    print '{0:<35}{1}'.format('parent brick', modified.tag.annotations[0].parents)
                    print '{0:<35}{1}'.format('maximum difference', m)
                    print '{0:<35}{1}'.format('maximum relative difference', (numpy.abs(original - modified.get_value()) / before[self.params.index(modified)]).max())
                    print '{0:<35}{1}'.format('differences indices', numpy.where((original!=modified.get_value()).sum(axis=1)))
                except:
                    pass
                import ipdb; ipdb.set_trace()

from subtensor_gradient import GradientDescent_SubtensorFix, AdaDelta_SubtensorFix, subtensor_params
lookups = subtensor_params(cg, [encoder.lookup, decoder.sequence_generator.readout.feedback_brick.lookup])
algorithm = GradientDescent_SubtensorFix(
    subtensor_params=lookups,
    cost=cost, params=cg.parameters,
    step_rule=CompositeRule([StepClipping(config['step_clipping']),
                             RemoveNotFinite(0.9),
                             AdaDelta_SubtensorFix(subtensor_params=lookups)])
)
algorithm = GradientDescent_Test(
    modified = algorithm,
    cost=cost, params=cg.parameters,
    step_rule=CompositeRule([StepClipping(config['step_clipping']),
                             RemoveNotFinite(0.9),
                             eval(config['step_rule'])()])
)

Disabling theano optimizations (and dropout/noise) the updates are exactly the sames, but with optimizations enabled I get some 1e-9 differences.

ejls commented 9 years ago

My mistake, I removed this commit. The sole difference with the repo I used to get the 10.43 bleu score on wmt15 fi-en is that I used the following reshuffled dataset: basedir = '/data/lisatmp3/simonet/wmt15/data/' config['src_data'] = basedir + 'all.tok.clean.shuf2.seg1.fi-en.fi' config['trg_data'] = basedir + 'all.tok.clean.shuf2.fi-en.en'

orhanf commented 9 years ago

Okay @ejls , 20% speed up is really amazing, thanks for the effort

rizar commented 9 years ago

Just a note, that the instead of using a custom LookupFeedbackWMT15 you guys could have a special token to be used as initial output. That seems to be even better, since the feedback at the first step would be trainable instead of being zeros.

orhanf commented 9 years ago

@rizar thank you for the pointer, actually with the recent changes, we are using separate indices for <S> and </S> during training and we do not need LookupFeedbackWMT15 anymore as you stated. LookupFeedbackWMT15 was there just to match costs etc with Ghog models. I will remove it from the MT example as-well.

kyunghyuncho / NMT

Some clean up and speed up #27