snipsco / ntm-lasagne

Neural Turing Machines library in Theano with Lasagne
https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315#.63t84s5r5
MIT License
300 stars 51 forks source link

Dynamic N-Grams task #12

Open tristandeleu opened 8 years ago

tristandeleu commented 8 years ago

Dynamic N-Grams task

I will gather all the progress on the Dynamic N-Grams task in this issue. I will likely update this issue regularly (hopefully), so you may want to unsubscribe from this issue if you don't want to get all the spam.

tristandeleu commented 8 years ago

Context length mismatch

I trained the NTM on the full Dynamic N-Grams task. The training is a lot longer than the previous tasks (maybe is due to the length of the input sequences being longer than usual). Like in the original paper, I trained the NTM on length 200 binary inputs sampled from a 6-grams look-up table ; this look-up table being sampled with a beta(1/2, 1/2) distribution. ngrams-06-fail Left: Write weights. Middle: Read weights. Right, from top to bottom: The input sequence, the bayesian optimum as computed in the original paper, the prediction from the NTM

The results are not very good but show an interesting behavior of the NTM. The model actually managed to keep track to some context which upon closer look does not correspond to a 6-gram model but rather to a 4-gram model (with contexts of length 3 instead of 5). This may be due to an initial look-up table that is "degenerated", where for a fixed length 3 context only 1 out of the 4 length 5 contexts have a significant probability. This seems to be confirmed by shorter inputs. ngrams-03-fail Here the bayesian optimum is computed on contexts of length 3 instead of 5 and the predictions appear to be more similar.

In this previous example, a analysis of the read weights suggest that certain locations in the memory correspond to given length 3 contexts

Location in the memory Context
~75 011
~22 101
~109 & 110 111
~104 + ~34 110

An issue with this current model is that the NTM does not write anything in memory and only relies on what it reads from the memory. The challenge here is that both heads have to be "in sync" as it has to read and write from the same locations in the memory, which means that both heads have to independently figure out where the contextual information is in the memory. Maybe we can improve that by tying parameters for both heads.

Parameters of the experiment

I used a setup which is similar to the ones used in previous tasks. I only added sign parameters (with linear activation clipped at [-1, 1]) for key (on both heads) and add (on the write head) to allow positive and negative value for elements in memory while maintaining the sparsity provided by the rectify activation function. In other words, I replaced key -> sign * key and add -> sign * add.