Parameter mismatch - Githubissues

sebastien-j commented 9 years ago

The number of parameters in the current code and in Groundhog doesn't match.

Here, with cg.parameters, we get 17 vectors/matrices, whereas there are 30 in Groundhog for Encoder-Decoder.

Breakdown by size, with V words and embeddings of dim. 100.

size	current	groundhog
V	1	1
V x 100	2	2
500 x V	1	0
100 x V	0	1
1000 x 1000	7	12
100 x 1000	5	7
1000	1	4
500 x 100	0	1
100	0	2

bartvm commented 9 years ago

So there's a few that I can pinpoint:

There is a 1000 x 1000 and 1000 x 1 missing that transforms the last hidden state of the encoder to the initial hidden state of the decoder
There is a 1000 x 1 missing which is the bias that is added after merging the hidden state and the feedback (i.e. the activations to the maxout)

My guess is then that there was a low-rank approximation used for the output softmax as well, which would mean that the 500 x V matrix should actually be a 100 x V plus a 500 x 100.

That still leaves four 1000 x 1000, two 100 x 1000, two 1000 x 1, and two 100 x 1 vectors unaccounted for. The two 100 x 1 vectors seem strange to me; you could use them as a bias for the low-rank approximation of the output, but that would be redundant because you're not using an activation function.

bartvm commented 9 years ago

Added the ones I pinpointed in 2db2871750c77b31e92c39a1626efe355b557e8f

bartvm commented 9 years ago

Okay, asked Cho and figured out where most of the others are coming from. Will have to wrap GatedRecurrent to take in a few extra contexts. Will try to do that later this afternoon or tomorrow.

sebastien-j commented 9 years ago

The 100 x 1 parameters were named b_0_enc_approx_embdr and b_0_dec_approx_embdr.

In GroundHog (encdec.py, around line 569), the embeddings were constructed with Multilayer using a linear activation function. We did not use learn_bias=False. It seems that we were doing a lookup, and then adding some bias to all embeddings.

I suppose that that we can safely drop these biases.

kyunghyuncho commented 9 years ago

@sebastien-j is correct. Those biases are not too necessary (unless somehow a model finds it important to have a DC component to the embedding,) and should be removed in the new implementation.

On Mon, Mar 9, 2015 at 1:00 PM, sebastien-j notifications@github.com wrote:

The 100 x 1 parameters were named b_0_enc_approx_embdr and b_0_dec_approx_embdr.

In GroundHog (encdec.py, around line 569), the embeddings were constructed with Multilayer using a linear activation function. We did not use learn_bias=False. It seems that we were doing a lookup, and then adding some bias to all embeddings.

I suppose that that we can safely drop these biases.

— Reply to this email directly or view it on GitHub https://github.com/kyunghyuncho/NMT/issues/1#issuecomment-77895475.

sebastien-j commented 9 years ago

@kyunghyuncho , there seems to be a small inconsistency between the grounghog code and the EMNLP paper supplementary material.

Using the EMNLP paper notation, it seems like V was removed.

Is the context c h or tanh(Vh+b)? Obviously, this won't matter for the attention model, but I'd still like to know what's going on.

sebastien-j commented 9 years ago

I have made a few changes to account for the missing 100 x 1000 matrices.

kyunghyuncho commented 9 years ago

@sebastien-j

Don't "initializers" do the latter?

On Mon, Mar 9, 2015 at 1:27 PM, sebastien-j notifications@github.com wrote:

@kyunghyuncho https://github.com/kyunghyuncho , there seems to be a small inconsistency between the grounghog code and the EMNLP paper supplementary material.

Using the EMNLP paper notation, it seems like V was removed.

Is the context c h or tanh(Vh+b)? Obviously, this won't matter for the attention model, but I'd still like to know what's going on.

— Reply to this email directly or view it on GitHub https://github.com/kyunghyuncho/NMT/issues/1#issuecomment-77901092.

sebastien-j commented 9 years ago

@kyunghyuncho ,

I don't understand what you mean. I am not sure about what the code exactly does.

By counting the parameters, there doesn't seem to be both Vand V' from the EMNLP paper. It seems like either

A) the context is the last hidden state and the initial decoder state is tanh(...), or
B) the context and initial decoder state are identical, both being tanh(...)

kyunghyuncho commented 9 years ago

@sebastien-j

I'm talking about

    def _create_initialization_layers(self):
        logger.debug("_create_initialization_layers")
        self.initializers = [ZeroLayer()] * self.num_levels
        if self.state['bias_code']:
            for level in range(self.num_levels):
                self.initializers[level] = MultiLayer(
                    self.rng,
                    n_in=self.state['dim'],
                    n_hids=[self.state['dim'] * self.state['hid_mult']],
                    activation=[prefix_lookup(self.state, 'dec', 'activ')],
                    bias_scale=[self.state['bias']],
                    name='{}_initializer_{}'.format(self.prefix, level),
                    **self.default_kwargs)

(in https://github.com/lisa-groundhog/GroundHog/blob/master/experiments/nmt/encdec.py#L847)

Further,

        init_states = given_init_states
        if not init_states:
            init_states = []
            for level in range(self.num_levels):
                init_c = c[0, :, -self.state['dim']:]
                init_states.append(self.initializers[level](init_c))

(in https://github.com/lisa-groundhog/GroundHog/blob/master/experiments/nmt/encdec.py#L1061)

bartvm commented 9 years ago

I've now got this:

size	current	groundhog
V	1	1
V x 100	2	2
100 x V	1	1
1000 x 1000	10	12
100 x 1000	7	7
1000	4	4
500 x 100	1	1
100	0	2

The 100-dimensional vectors were redundant, but I'm wondering where the 1000 x 1000 discrepancy comes from. The ones I can think of:

6 transition matrices (encoder, decoder each have state-to-state, state-to-reset and state-to-update)
last hidden state encoder to initial decoder state
last hidden state encoder to transition context
last hidden state encoder to readout context (i.e. to the maxout layer)
decoder hidden state to readout

@kyunghyuncho @sebastien-j Which two matrices am I missing?

Also, where do you use biases? Technically speaking I guess I could have 6 for the inputs in both the decoder and encoder, plus one for the encoder state to decoder initial state, and one for the maxout input (decoder hidden state to readout). That gives a total of 8, twice what GroundHog has... I guess that you don't use biases for the update and reset gates? Why not?

sebastien-j commented 9 years ago

Right now, I encounter an error when running the code AsTensorError: ('Cannot convert [merge_apply_states, merge_apply_feedback, merge_apply_readout_context] to TensorType', <type 'list'>), so it may be harder for me to pinpoint what is missing.

Here are the 12 1000 x 1000 matrices from Groundhog, with their names and usage. Encoder RNN transition : 3 : {W, G, R}_enc_transition_0 Decoder initialization : 1 : W_0_dec_initializer_0 Decoder RNN transition: 3 : {W, G, R}_dec_transition_0 Decoder RNN context : 3 : W_0_dec_dec_{inputter, updater, resetter}_0 Readout : 2 : W_0_dec_{hid, repr}_readout

Looking at the parameter names, there are indeed no biases for the update and reset gates.

bartvm commented 9 years ago

You'll need to update your copy of the readout branch of Blocks, I made another change there that prevents that error.

Okay, that solves the mystery, there's a different context for the inputs, reset and update gates, while I only had one for the inputs. I'll add those two and then I think it's an exact match.

bartvm commented 9 years ago

Done in 4834ef17572052b66fb912debd0af22765ba9c59

kyunghyuncho commented 9 years ago

On Tue, Mar 10, 2015 at 4:13 PM, Bart van Merriënboer < notifications@github.com> wrote:

I've now got this: size current groundhog V 1 1 V x 100 2 2 100 x V 1 1 1000 x 1000 10 12 100 x 1000 7 7 1000 4 4 500 x 100 1 1 100 0 2

The 100-dimensional vectors were redundant, but I'm wondering where the 1000 x 1000 discrepancy comes from. The ones I can think of:

6 transition matrices (encoder, decoder each have state-to-state, state-to-reset and state-to-update)

last hidden state encoder to initial decoder state

last hidden state encoder to transition context

last hidden state encoder to readout context (i.e. to the maxout layer)

decoder hidden state to readout

@kyunghyuncho https://github.com/kyunghyuncho @sebastien-j https://github.com/sebastien-j Which two matrices am I missing?

I can't really think of any, but I'll take a look at it tomorrow or a day after when I'm back in Montreal. Just remind me at some point.

Also, where do you use biases? Technically speaking I guess I could have 6 for the inputs in both the decoder and encoder, plus one for the encoder state to decoder initial state, and one for the maxout input (decoder hidden state to readout). That gives a total of 8, twice what GroundHog has... I guess that you don't use biases for the update and reset gates? Why not?

Blame my laziness, when I was coding up GatedRecurrent in GroundHog.

— Reply to this email directly or view it on GitHub https://github.com/kyunghyuncho/NMT/issues/1#issuecomment-78137793.

sebastien-j commented 9 years ago

Thanks Bart.

Now that we have all parameters, the next sanity check would probably be to have one Blocks and one Grounghog model with the same parameters values, and verify that the cost is identical for a few training examples.

Looking at the logs, I think that here we display the average cost per token instead of the total cost.

For the encoder-decoder model, was there an EOS token on the source side?

kyunghyuncho commented 9 years ago

On Tue, Mar 10, 2015 at 10:08 PM, sebastien-j notifications@github.com wrote:

Thanks Bart.

Now that we have all parameters, the next sanity check would probably be to have one Blocks and one Grounghog model with the same parameters values, and verify that the cost is identical for a few training examples.

Looking at the logs, I think that here we display the average cost per token instead of the total cost.

For the encoder-decoder model, was there an EOS token on the source side?

Yes.

— Reply to this email directly or view it on GitHub https://github.com/kyunghyuncho/NMT/issues/1#issuecomment-78187371.

bartvm commented 9 years ago

Ah, I don't have one for the source side right now; I'll add that one.

I monitor the cost per token as well.

sebastien-j commented 9 years ago

With the latest changes (cb3b001d69bfc884cb3029dccccf73e5e9fbba92), the number of parameters has increased to 33.

Some of them are 1000 x 1 biases. If these are biases for the update and reset gates (I am not sure), we can possibly leave them there. There is an additional 1000 x 1000 matrix that I cannot account for.

@bartvm , how does **merge(...) work?

The current code only works with recent versions of Theano, otherwise I get dimension mismatch errors. Do you know what changes in Theano cause this?

bartvm commented 9 years ago

I added in the biases to the update and reset gate, yeah, considering @kyunghyuncho said that he had probably just forgotten them. The additional 1000 x 1000 matrix shouldn't be there though, let me have a look.

** is the "splat operator" (at least that's what it's called in Ruby and Perl) which unpacks a set of keyword arguments from a dictionary i.e. f(a=1, b=2) is the same as f(**{'a': 1, 'b': 2}). The merge function is from the Toolz package and basically just combines two dictionaries.

The Theano change is https://github.com/Theano/Theano/pull/2602, which fixed a bug that prevented the Merge brick from working.

bartvm commented 9 years ago

The extra parameter came from the fact that I used fork to transform readout_context. However, Readout automatically applies a linear transformation before merging with the other context parameters, so it was being transformed twice. There was also one bias missing (right before the maxout network). All the parameters should be correct again now.

kyunghyuncho / NMT

Parameter mismatch #1