lmjohns3 / theanets

Neural network toolkit for Python
http://theanets.rtfd.org
MIT License
328 stars 74 forks source link

The current clockwork RNN implementation #125

Closed ypxie closed 8 years ago

ypxie commented 8 years ago

http://theanets.readthedocs.org/en/v0.6.2/generated/theanets.layers.recurrent.Clockwork.html

The fast module should recieve information from slow modules with larger time period, and not vice versa,

But the current implementation is on the contrary.

ypxie commented 8 years ago

@lmjohns3

githubnemo commented 8 years ago

It would have been helpful if you included the way you have come to this conclusion. So I try this now.

Arguing (for a CWRNN with 1 node/module and 3 modules) the mask currently is:

1 1 1
0 1 1
0 0 1

Columns are ordered left to right from slow to fast. Assuming this to be the weight matrix, simulating an activation vector gives

dot([1 1 1], [[1 1 1],[0 1 1],[0 0 1]) = [1,2,3]

So the leftmost unit receives the least input. The reverse happens when using the alternative mask:

dot([1 1 1], [[1 0 0], [1 1 0], [1 1 1]]) = [3,2,1]

So yes, it seems to me that the shortcut connections are reverse.

The patch is (probably) as simple as

-mask[i*n:(i+1)*n, i*n:] = 1
+mask[ i*n:, i*n:(i+1)*n] = 1
lmjohns3 commented 8 years ago

Yeah, this can be tricky to reason about. I've read your argument carefully but I think the current code is correct; let me see if I can convince you. The code assumes, as I will here, that the modules are sorted in decreasing order of period, so the slowest one is on the left and the fastest is on the right.

Symbolically, the hidden state can be represented

[ h3  h2  h1 ]

for three modules, each with one node, that have periods 3, 2, and 1, respectively. The masked weight matrix is:

[ w33 w32 w31 ]
[  0  w22 w21 ]
[  0   0  w11 ]

So after the dot product, we'd get the vector:

[ h3*w33  h3*w32+h2*w22  h3*w31+h2*w21+h1*w11 ]

Again, the modules are sorted from slowest on the left (period 3) to fastest on the right (period 1). The symbolic variables show that the hidden state from the slowest module is incorporated in the output of all three modules, but the fastest module only affects the output of itself.

At any rate I agree that there could be a couple more comments in the code along these lines. I'll try to adapt this response and include it in the mask generation somewhere.

I'm going to close this but feel free to reopen if you're not convinced by this line of reasoning.

ypxie commented 8 years ago

Thanks for the explanation. I didnot notice that the periods has even reversed I the Intit function. Is there any specific reason to reverse them ?

githubnemo commented 8 years ago

@lmjohns3 thanks for taking your time with this. Instead of fixing this quite non-intuitive behaviour with more documentation I would argue that, considering how easy the fix is, the mask should be the other way around, allowing for ascending period specification. There is really no reason to make things more complicated by introducing caveats like this.

talpay commented 8 years ago

In your examples, the period-vector is passed with increasing numbers (1,2,4). If your reasoning is correct, then it has to be (4,2,1) which makes the examples incorrect.

I really think it makes sense to follow the convention of the original paper by Koutnik. It's the behavior I'd expect when having read the paper and then coming here to try the code.

lmjohns3 commented 8 years ago

In your examples, the period-vector is passed with increasing numbers (1,2,4). If your reasoning is correct, then it has to be (4,2,1) which makes the examples incorrect.

The periods were sorted internally in descending order. I didn't want there to be some weird bug where a user could add periods in arbitrary order and let that prevent the layer from working.

At any rate I've reversed the sorted order of the periods and the order of the mask elements since it appears multiple people have the same mental model of how the layer ought to work.