Closed Z-Z188 closed 1 month ago
Hi, thank you for your question.
The section you are looking at discusses the update rule for a hidden state which is a K-layer MLP. Therefore, k here refers to the k-th layer in this MLP, and W^{k} means the weight matrix of that layer.
Thanks for your reply. The weight of matrix W is updated during forward: from W0,W1,W2…… to Wb. So what does the W refer to in this formula?W0?W1?Wb?Look forward to your reply. Thanks!
Say \theta^0 refers to the entire parameters of a hidden state (a neural network in our case). Therefore, \theta^1, ... ,\theta^b are the updated entire parameters after each TTT mini-batch, which I think is what you are referring to.
On the other hand, W_k means the weight of the k-th layer inside this hidden state (a neural network), so \theta^t = {W^t_0, ..., W^t_K}.
Thanks for your reply. So when we calculate output Z,what W we use(W in the red box)?W0?W1?Wb?
Say at time step b (the b-th TTT mini-batch), the hidden state value is \theta_b = {W_b^1, ..., W_b^K}.
The final output Z_b at time step b is calculated by forwarding through the K layer in \theta_b sequentially, e.g., computing Z_b^1 to Z_b^K.
Your confusion may come from in the formula, it's not specific to a time step, since it applies to any time step.
it's not specific to a time step, since it applies to any time step.
Thanks for your reply ! The sentence "it's not specific to a time step, since it applies to any time step." make me understand. Thanks for your patience.
You are very welcome!
But another question:why the W refers to W0?Is it specific to a time step?Look forward to your reply. Thanks!
Here W_0 stands for the initial value of hidden states for an arbitrary time step. Here 1, ..., b refer to the 1st to b-th tokens in side an arbitrary TTT mini-batch. W_1, ..., W_b stand for the updated hidden state values for each token in that TTT mini-batch.
Therefore, it's not specific to a time step.
Thank you for your reply. May I ask when we calculate,why we should change the gradient of loss to W0 to the Z?
Sorry I didn't get your question, could you point me to the specific formula you are referring to?
Sorry I didn't get your question, could you point me to the specific formula you are referring to?
When calculating the updated W, we use the gradient of the loss with respect to W₀, but why is the gradient of the loss with respect to Z used in the subsequent formulas? Why is such a conversion necessary?
$\nabla{Z^k}l$ comes from the first factor of $\nabla{W^k0}l=\nabla{Z^k}l \cdot \left(\hat{X}^k\right)^T$. The point of dual form is by applying the associative rule of matrix multiplication, we can direct calculate the output of the second forward pass $\bar{Z}$ without explicitly materializing $\nabla_{W^k_0}l$.
Please refer to Appendix A.3 for more details.
Hello author, thank you very much for your work. What does this W refer to? Is it W0 or Wb? I don't understand. Look forward to your reply, thank you!