missing terms in partial derivatives?

myhussien commented 8 years ago

Peter,

Thank you so much for the great RNN tutorial post. This might seem long, but it is very quick.

1 - For Part 1, you defined the states array S to be 1x1. How will your example change if one decided to use 2 hidden states for example. The clear final solution is that one of them will be turned off, but how would you define it?. In this case your wRec will be 2x1 right?

In the example you provided for this section, you assumes that the weight between the last state and the final output is already given and it equals 1, right? since all the RNN exmaples talks about Wx, Wrec, and Wy that goes from hidden to output.
Also, how would you arrange the data if you want multi-dimensional input and multi-dimensional output at the same time? For example, each time step has a vector input and a vector output.

2- In the same part - section “Compute the gradients with the backward step”; you explain BPTT briefly, and it is not clear to me how you came up with the partial derivatives. I worked out a small 3 time steps example.

Questions:

Why your summation starts from 0?
Why there is dc/dSk ? , dc : “partial for cost”
I found that there is Wrec in the derivatives. Did you miss that or is it included somewhere?

My Example,

dc/dwx = dc/dy * dy/wx dc/dy = 2(y - t)

but y in this example is nothing but (S2 * 1), so:

y = S3 y = x3 * wx + S2Wrec ,… substitute for S2 y = x3 * wx + (x2 wx + S1* Wrec )* Wrec , …. expand y = x3 * wx + x2* wx * Wrec + S1* Wrec^2 ,…. Substitute for S1 y = x3 * wx + x2* wx * Wrec + (x1wx + S0 * Wrec ) Wrec^2 ,… expand y = x3 * wx + x2* wx * Wrec + x1wx * Wrec^2 + S0 \ Wrec^3

then,

dy/dwx = x3 + x2 * Wrec + x1 * Wrec^2 = sum (xi * Wrec^(i-1)) where i = {1,2,3}

Best, -M

peterroelants commented 8 years ago

Hi @myhussien ,

To answer your questions:

1 - For Part 1, you defined the states array S to be 1x1. How will your example change if one decided to use 2 hidden states for example.

Part 2 of the tutorial has an example where there are 'multiple states': http://peterroelants.github.io/posts/rnn_implementation_part02/ . The multiplications just become matrix/vector multiplications.

In the example you provided for this section, you assumes that the weight between the last state and the final output is already given and it equals 1, right?

I just defined the output y to be the last state. Obviously this equals to some 'Wy' (which I didn't define) to be 1, since 1 is the multiplicative identity. Conceptually its different than a Wy because I choose not to define it.

Why your summation starts from 0?

Because I fixed the initial state to 0.

Why there is dc/dSk ? , dc : “partial for cost”

To illustrate how the gradient propagates through each timestep.

Note that you need to understand backpropagation of multilayer neural nets before understanding BPTT.

myhussien commented 8 years ago

Thanks for the comments. I just wanted to note that I know how BPTT works, but your summation starts from 0 and it has Xn in it and there is no X0 defined before.

I went through your second example, but can you comment on this part

"how would you arrange the data if you want multi-dimensional input and multi-dimensional output at the same time? For example, each time step has a vector input and a vector output."

Thanks

peterroelants commented 8 years ago

"how would you arrange the data if you want multi-dimensional input and multi-dimensional output at the same time? For example, each time step has a vector input and a vector output."

In part 2 of the tutorial I describe an example with vector input (of size 2). If you want a vector output all you need to do is use a matrix to transform the state to the output with more than 1 column.

peterroelants / peterroelants.github.io

missing terms in partial derivatives? #6