lucidrains / x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.42k stars 377 forks source link

Question: num_memory_tokens > 0 and return_mems = True #216

Closed pfeatherstone closed 7 months ago

pfeatherstone commented 7 months ago

I'm investigating XL-recurrence while preserving num_memory_tokens > 0. Looking at the code, it looks like mem is prepended to k and v AFTER memory tokens have been prepended. By memory tokens, I mean those added through num_memory_tokens > 0, NOT attn_num_mem_kv > 0. The sequence going into attention is: | mems | memory tokens | data | Is this correct? I would have thought the following would be more correct: | memory tokens | mems | data |

Cheers

pfeatherstone commented 7 months ago

By the way, I still don't understand the difference between num_memory_tokens > 0 and attn_num_mem_kv > 0. I can see from the code they are added at different stages, the former early on, the latter in the attention layer specifically, and each attention layer gets its own mem_k and mem_v. However, fundamentally, I don't see the difference in what they are trying to achieve.

By the way, this point was discussed in https://github.com/lucidrains/x-transformers/issues/193.

lucidrains commented 7 months ago

memory tokens can also query and representation evolves as it goes through the network. the keys and values change with the context

memory key / values are static

lucidrains commented 7 months ago

in my mind, they both the address some similar issues, but memory tokens are more powerful. memory tokens also only make sense in encoder setups (although i have an improvised interspersed memory tokens for causal in the repo, not sure if it works with XL). you should just use memory key / values, 4 should be enough