Closed pfeatherstone closed 7 months ago
By the way, I still don't understand the difference between num_memory_tokens > 0
and attn_num_mem_kv > 0
. I can see from the code they are added at different stages, the former early on, the latter in the attention layer specifically, and each attention layer gets its own mem_k
and mem_v
. However, fundamentally, I don't see the difference in what they are trying to achieve.
By the way, this point was discussed in https://github.com/lucidrains/x-transformers/issues/193.
memory tokens can also query and representation evolves as it goes through the network. the keys and values change with the context
memory key / values are static
in my mind, they both the address some similar issues, but memory tokens are more powerful. memory tokens also only make sense in encoder setups (although i have an improvised interspersed memory tokens for causal in the repo, not sure if it works with XL). you should just use memory key / values, 4 should be enough
I'm investigating XL-recurrence while preserving
num_memory_tokens > 0
. Looking at the code, it looks likemem
is prepended tok
andv
AFTER memory tokens have been prepended. By memory tokens, I mean those added throughnum_memory_tokens > 0
, NOTattn_num_mem_kv > 0
. The sequence going into attention is: | mems | memory tokens | data | Is this correct? I would have thought the following would be more correct: | memory tokens | mems | data |Cheers