Learning the initial weights of the heads leads to NaN

snipsco / ntm-lasagne

Neural Turing Machines library in Theano with Lasagne

https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315#.63t84s5r5

MIT License

300 stars 51 forks source link

Learning the initial weights of the heads leads to NaN #2

Open tristandeleu opened 8 years ago

tristandeleu commented 8 years ago

When setting the learn_init=True parameter on the heads, the error and parameters of the heads become NaN after a few iterations (not necessarily on the first one, but it can happen after 100+ iterations).

How to reproduce it:

heads = [
    WriteHead([controller, memory], shifts=(-1, 1), name='write', learn_init=True),
    ReadHead([controller, memory], shifts=(-1, 1), name='read', learn_init=True)
]

This is a non-blocking issue since learning these weights may not actually make sense (we can just leave equi-probability as is as the first step).

tristandeleu commented 8 years ago

This may be due to the norm constraint of the weights (and the initial weights) being violated during training. The weights are required to sum to one, but the (vanilla) training procedure does not guarantee it. The sum-to-one constraint is critical since other setups may lead to entries in w_tilde being negative -- which explains the NaNs in w \propto w_tilde ** gamma.

tristandeleu commented 8 years ago

Learning the initial weights might be something we'll eventually need. Initializing them to a uniform probability over all the addresses almost necessarily forces the first step to write in a distributed way (over multiple addresses, instead of hard addressing).

Instead of learning the raw weight_init, which may have some issues as explained in https://github.com/snipsco/nlp-neural-turing-machine/issues/2#issuecomment-141499219, we could learn some kind of initialization that needs to go through a normalization step to get w_0. The process would be to learn weight_init (keep this shared variable as a parameter) and then get the first weight as

w_0 = normalize(rectify(weight_init))

With an additional rectify() nonlinearity to favor sparse initializations.

EderSantana commented 8 years ago

hi @tristandeleu, when learning the initial weights, are you making sure they are behind of a softmax? In other word, are you learning the initial logits instead? If so, there is no problem if they get negative values. I had this problem in my NTM implementation as well.

tristandeleu commented 8 years ago

When I originally opened this issue I didn't, which was a mistake on my end. I haven't tried to learn the logit. But indeed you're right, I think this is the right solution (I only sketched the idea in this issue). All in all I ended up leaving the learn_init=False for the weights in my experiments, and initialize them as one-hot vectors. But I haven't found a good way to allow both to fix the initialization (eg. with OneHot) and learning logit weights.

EderSantana commented 8 years ago

I'm new to your codebase, could you point me out where you get the initial weights, I could try to check that out.

tristandeleu commented 8 years ago

The initial weights are defined here: https://github.com/snipsco/ntm-lasagne/blob/master/ntm/heads.py#L102 But for now, there's no correct way to learn these weights unfortunately.