Closed maojiaqi111 closed 3 months ago
Hi, did you remove the transformer's K,Q,V entirely, and replaced it with TTT-defined K,Q,V's?
Hi @maojiaqi111
Is it possible to provide some code to reproduce your error? Or did you try to directly train the provided TTT model instead of modifying self-attention?
Hi, @xvjiarui Thanks for your question, +1 Also, I wonder whether it's ever feasible to load k, q, v from the pretrained transformer (standard transformer's self-attention, not the k,q,v in ttt.linear) directly/easily? Or does one have to map each key to ttt.linear manually? If the latter, then I guess the mapping would require very delicate efforts and susceptible to errors?
Thank you!
@LuoyaoChen Hi. The QKV in a transformer are analogous to those in TTT. You should be able to load them and they should correspond with the same shape. Page 7:
Hello,
I have replaced a model's self-attention with
TTTLinear
using the code provided. When performingloss.backward
on the first batch, there are no issues. However, duringloss.backward()
on the second batch, I encounter the following error:It seems to be related to the release of some intermediate values in the
ttt_layer
or related to gradient computation. Since I'm not very familiar with the internal gradient calculations ofttt_layer
, I cannot pinpoint the faulty code.Do you have any insights or suggestions on how to resolve this issue?
Thank you!
You can copy and paste this text into your GitHub issue.