mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 351 forks source link

Initial feedback should be (optionally) zero #502

Open bartvm opened 9 years ago

bartvm commented 9 years ago

CC @sebastien-j

Right now the code in BaseSequenceGenerator.cost() sets the initial feedback as a function of readout.initial_outputs(). When using a SoftmaxEmitter this means that the initial feedback is basically the word embedding for the token with index 0. There are two problems with this:

rizar commented 9 years ago

It's very easy to fix in the cost method, but slightly harder in generate. The only 100% solution I can think is to track the step number and not use feedback when it is zero.

Or another way: it seems reasonable for me to say that initial_outputs of a SoftmaxEmitter should return -1. Then the LookupFeedback can return zeros when its input is negative. This behavior is a bit shady, but can be activated with a flag, in which case it seems okay to me.

I vote for the second option, let me know if you have another ideas.

kyunghyuncho commented 9 years ago

What I usually do when generating from these types of models is

    # if it's the first word, emb should be all zero
    emb = tensor.switch(y[:,None] < 0, tensor.alloc(0., 1, tparams['Wemb_dec'].shape[1]), 
                        tparams['Wemb_dec'][y])

As @rizar said, this should be on a lookup module which should anyways work only with a non-negative integer.

bartvm commented 9 years ago

But these solutions effectively hardcode the assumption that we're outputting labels. What's wrong with just replacing these lines with something like zeros_like(...)?

kyunghyuncho commented 9 years ago

How about a kind of strict mode at LookupFeedback?

If the input to LookupFeedback is a negative integer, there will be an error raised (default). Instead, if LookupFeedback instance is initialized with, say, strict=False, it simply returns tensor.alloc(0., outputs_shape[0], self.num_outputs).

After all, LookupFeedback is defined to work with an integer input:

A feedback brick for the case when readout are integers

https://github.com/bartvm/blocks/blob/8667c7c52d390405bbb93701a3517bafcc579cc2/blocks/bricks/sequence_generators.py#L499

bartvm commented 9 years ago

I don't follow. My point is that we can hack away at LookupFeedback and Lookup, but that ignores the fact that people might be using a completely different type of feedback. To properly solve it, we need to change the behavior of BaseSequenceGenerator directly.

kyunghyuncho commented 9 years ago

But, isn't this very task-specific?

bartvm commented 9 years ago

How so? My targets could be real numbers, and the feedback I receive from them is that same number plus 10. The current code would give me 10s as initial feedback, while I just want zeros.

For the BaseSequenceGenerator.cost method this seems like it would be a more general solution.

initial_feedback = self.readout.feedback(self.readout.initial_outputs(
        batch_size, **contexts))
if zero_init_feedback:
    initial_feedback = tensor.zeros_like(initial_feedback)
feedback = tensor.set_subtensor(feedback[0], initial_feedback)

I have to say that I just started looking at the BaseSequenceGenerator.generate method and the BeamSearch class, but neither have any docstrings or documentation that makes it clear for me how to use them. The tests for BeamSearch are very convoluted as well, because they depend on the word reversal example, which has a custom generate method, instead of just being a self-contained test case. I'll need to look a bit more into the SequenceGenerator generation framework before it's entirely clear to me how to use these things, and by extension, how the initial feedback problem applies here.

kyunghyuncho commented 9 years ago

But, then, you should re-work the generator separately. If you change LookupFeedback to return all zero vectors (when strict=False) and change initial_outputs here to return -1 instead of 0, this should work with both cost and generator simultaneously, no?

bartvm commented 9 years ago

SoftmaxEmitter is used, so you would need to change these lines instead. And yes, then it would work for both generation and cost calculation, but it would only work when using SoftmaxEmitter combined with LookupFeedback, not for any other kinds of readout or feedback, like the case I mentioned. It would also not work if you're applying any sort of transformation (with bias) on the target word embeddings before feeding them back in. Hence, I'm not saying it won't work, I'm just saying it's a hack, and I think we should find a more proper solution if possible.

kyunghyuncho commented 9 years ago

Right. I agree that there may be a more proper solution. Just I'm afraid that this will be rather a task-specific thing.

That being said, we can always reserve the index 0 for the initial symbol.

bartvm commented 9 years ago

I think that would be the best solution to use in most cases, because it's probably more powerful to learn the initial feedback than to give all zeros anyway. For reproducing other work (like the GroundHog translation model) I guess it'd be nice to have a solution though. If we don't want to spend time on this, we can just hack it in our own Blocks branch for NMT, but I'd rather not merge that.

rizar commented 9 years ago

Actually there is no solution for all cases, in which I agree with Cho. The sequence generator is an abstract class that does not directly solve you problem, but allows you to solve.

From the viewpoint of the sequence generator there is such thing as initial outputs, which it expects from readout.initial_outputs. Then readout.feedback knows how to use those. That is the problem is entirely transparent to it, these are all internal intricacies of the Readout component. That is BaseSequenceGenerator should not be blamed for the fact that you can not currently do what you want.

The readout component itself contains the emitter and and the feedback brick as subcomponents. Its is interaction of emitter.initial_outputs and feedback_brick.feedback that defines who sort of feedback (or its absence) is used to generate the first output. So all this question should be solved by changing behaviour of these two methods.

Currently there is an arbitrary assumption in SoftmaxEmitter that the initial output is always zero. This assumption does not have any good motivation, the code just needs continuous improvement. I propose to make it configurable by a constructor argument, so that it could be set to -1, for instance. This can go very quickly to Blocks master.

Then the LookupFeedback currently assumes that all its inputs are valid indices in the lookup table. To be honest I am not sure if we should change this behavior, because it is very clear and simple.

So what about just creating a LookupFeedback with num_outputs just one more than needed and using this new index num_outputs - 1 as an initial output? The lookup tables from the old translation models would have to be appended with one zero row, but I guess it is now a big issue.

I am sorry that the code is not documented: all my time is eaten by some other things...

bartvm commented 9 years ago

Actually there is no solution for all cases, in which I agree with Cho.

I really don't see why this is true? Perhaps your code doesn't allow for an easy solution because of the assumptions it makes, but in general it shouldn't be true, because letting the first feedback consist of zeros is not specific to labels, softmaxes, etc. My (artificial) example of a case where your outputs are real valued shows this, clearly. Hence, I don't see why we should necessarily move this general behavior of "ignore the first feedback" into the emitter and feedback components. These components are generally concerned with providing a sensible feedback from outputs. Since there is no outputs, it is normal that they don't know what to do. So there are multiple options:

My suggestion is the more general one, which will work for all possible combinations of emitter and feedback bricks, because it sets the feedback to zero where we really (might) want it to be zero: right before being conditioned on.

kyunghyuncho commented 9 years ago

@bartvm

Is it easy (or even possible) to code SequenceGenerator to ignore any preceding block when it's the first input?

Meanwhile, @sebastien-j and I talked about this offline. We decided to, for now, wrap SoftmaxEmitter and LookupFeedback in NMT temporarily to work this around. This will give us more time before we decide what should be done with Blocks.

bartvm commented 9 years ago

@kyunghyuncho The code I have does exactly that for the SequenceGenerator.cost method

initial_feedback = self.readout.feedback(self.readout.initial_outputs(
        batch_size, **contexts))
if zero_init_feedback:
    initial_feedback = tensor.zeros_like(initial_feedback)
feedback = tensor.set_subtensor(feedback[0], initial_feedback)

I'm just not sure how SequenceGenerator.generate fits in here.

As a temporary workaround I would follow @rizar's suggestion: Just append/prepend the target word embedding matrix with a row of zeros, and make SoftmaxEmitter return dim/0 by default. Should be a really easy fix to do.

kyunghyuncho commented 9 years ago

I meant for both cost and generate, but yes, we'll make a temporary workaround in NMT for now.

On Wed, Mar 18, 2015 at 8:32 PM, Bart van Merriënboer < notifications@github.com> wrote:

@kyunghyuncho https://github.com/kyunghyuncho The code I have does exactly that for the SequenceGenerator.cost method

initial_feedback = self.readout.feedback(self.readout.initial_outputs( batch_size, **contexts))if zero_init_feedback: initial_feedback = tensor.zeros_like(initial_feedback) feedback = tensor.set_subtensor(feedback[0], initial_feedback)

I'm just not sure how SequenceGenerator.generate fits in here.

As a temporary workaround I would follow @rizar https://github.com/rizar's suggestion: Just append/prepend the target word embedding matrix with a row of zeros, and make SoftmaxEmitter return dim/0 by default. Should be a really easy fix to do.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/502#issuecomment-83235023.

bartvm commented 9 years ago

For generate I'm not sure if there's an easy way, but I guess one option would be to split the current generate method into two: One application (e.g. generate_all) which doesn't take outputs and uses a zero vector (or the initial_ouputs, like the code in my previous comment) to generate the first output, and then it calls the current generate method with that output to produce the rest of the sequence.

rizar commented 9 years ago

@bartvm , I agreed with you that it would be possible to make feedback always zero at the first step. I do not agree though that this is always a desirable solution. I would say that having this feedback as a trainable parameter is a sensible solution, and if one does not use zero index for his symbols, this is the default behaviour.

For this reason I think that there is no universal solution at the sequence generator level. As you said, this is a framework, not plug and play solution.

I looked at your code with zero_init_feedback and it looks good, but I don't know an easy way to do them same trick in generate. I would not like to hardcode in generate a particular value for fake previous outputs. I think that feedback bricks are places for such things. Splitting it into two methods would do the job, but significantly increases the code complexity. If we really decide to support zero_init_feedback at the sequence generator level, I would keep track of the current iteration in generate and when it is zero, not use feedback from the feedback brick.

@kyunghyuncho , when you have your workaround implemented, can you please give me a link so that I could put my 2 cents?

rizar commented 9 years ago

@kyunghyuncho , would #511 help you?

bartvm commented 9 years ago

I would not like to hardcode in generate a particular value for fake previous outputs

But I'm not hardcoding anything! I'm simply letting zero_init_feedback be a keyword argument which let's the user decide whether to ignore the first feedback (set it to zeros) or whether to use the initial outputs from the emitter (which would make the initial feedback trainable). As such it allows for both yours as well as my solution.

Splitting it into two methods would do the job, but significantly increases the code complexity. If we really decide to support zero_init_feedback at the sequence generator level, I would keep track of the current iteration in generate and when it is zero, not use feedback from the feedback brick.

But at least it's a proper solution. What really increases code complexity in the long term is hacking things together by adding special negative indices for lookup values IMHO.

rizar commented 9 years ago

But I'm not hardcoding anything!

At some point before you said "if we are hardcoding 0 as a beginning-of-sentence token, we need to make this very clear in the documentation" and was I addressing this, saying that I do not think we should have it at the SequenceGenerator level.

I think I know a way to implement zero_init_feedback without making the code more complex. Let's support an iteration_numbers=True option in the recurrent decorator. It might be not the only case when this is needed. If this option is given, and additional sequential input can be given to theano.scan, which will go at the first position, and generate signature will be changed to def generate(iteration_number, **kwargs). I guess it is clear how to use this iteration number in the code later. What do you think?

bartvm commented 9 years ago

I thought about this, but you need to use Theano's if or switch op in order to behave differently when iteration_number is 0 then, right? I would avoid that if at all possible, because they can be slow, cause optimization issues, etc.

At some point before you said "if we are hardcoding 0 as a beginning-of-sentence token, we need to make this very clear in the documentation" and was I addressing this, saying that I do not think we should have it at the SequenceGenerator level.

I still don't really follow. I was pointing out that 0 is currently hardcoded, and that this needs to be documented. I was then pointing out that I am no suggesting hardcoding anything into the SequenceGenerator, just suggesting that the user needs to have two options: Rely on a default value from the emitter, or just ignore the feedback entirely. The latter needs to be managed on SequenceGenerator level.

rizar commented 9 years ago

I think one can do it without if or switch. tensor.eq(iteration_number, 0) and 1 - tensor.eq(iteration_number, 0) give you coefficients for a linear combination between a zero vector and a feedback vector.

Zero is hardcoded in SoftmaxEmitter, and it has to be documented there, right. I agree with all the rest as well, seems like there was just some misunderstanding, I though you sort of want to push this zero to the sequence generator level.

bartvm commented 9 years ago

That sounds good, so we end up with something like this if I understand your suggestion correctly:

@recurrent(iteration_numbers=True)
def generate(iteration_number, outputs, **kwargs):
    ...
    feedback = self.readout.feedback(outputs)
    if zero_init_feedback:
        feedback = (1 - tensor.eq(iteration_number, 0)) * feedback
    next_readouts = self.readout.readout(feedback=feedback, **dict_union(states, next_glimpses, contexts))

I just don't know how you want to get zero_init_feedback into generate. There's no way of passing non-Theano variable arguments to scan functions that I know of.

rizar commented 9 years ago

You code looks good.

I think zero_init_feedback should be a flag of BaseSequnceGenerator, because it's weird to pass it twice to both cost and generate methods.

bartvm commented 9 years ago

Right, that also solves the problem of having access to zero_init_feedback if it's an attribute of self.

rizar commented 9 years ago

Then I think the design phase is done :)