Dynamic N-Grams task

Context length mismatch

I trained the NTM on the full Dynamic N-Grams task. The training is a lot longer than the previous tasks (maybe is due to the length of the input sequences being longer than usual). Like in the original paper, I trained the NTM on length 200 binary inputs sampled from a 6-grams look-up table ; this look-up table being sampled with a beta(1/2, 1/2) distribution. ngrams-06-fail Left: Write weights. Middle: Read weights. Right, from top to bottom: The input sequence, the bayesian optimum as computed in the original paper, the prediction from the NTM

The results are not very good but show an interesting behavior of the NTM. The model actually managed to keep track to some context which upon closer look does not correspond to a 6-gram model but rather to a 4-gram model (with contexts of length 3 instead of 5). This may be due to an initial look-up table that is "degenerated", where for a fixed length 3 context only 1 out of the 4 length 5 contexts have a significant probability. This seems to be confirmed by shorter inputs. ngrams-03-fail Here the bayesian optimum is computed on contexts of length 3 instead of 5 and the predictions appear to be more similar.

In this previous example, a analysis of the read weights suggest that certain locations in the memory correspond to given length 3 contexts

Location in the memory	Context
~75	`011`
~22	`101`
~109 & 110	`111`
~104 + ~34	`110`

An issue with this current model is that the NTM does not write anything in memory and only relies on what it reads from the memory. The challenge here is that both heads have to be "in sync" as it has to read and write from the same locations in the memory, which means that both heads have to independently figure out where the contextual information is in the memory. Maybe we can improve that by tying parameters for both heads.

Parameters of the experiment

I used a setup which is similar to the ones used in previous tasks. I only added sign parameters (with linear activation clipped at [-1, 1]) for key (on both heads) and add (on the write head) to allow positive and negative value for elements in memory while maintaining the sparsity provided by the rectify activation function. In other words, I replaced key -> sign * key and add -> sign * add.

snipsco / ntm-lasagne

Dynamic N-Grams task #12

Context length mismatch

Parameters of the experiment