thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
MIT License
269 stars 21 forks source link

Representative Score计算与Memory Lookup的实现细节? #45

Closed Becomebright closed 1 month ago

Becomebright commented 1 month ago

您好,论文3.2节关于Representative Score与Memory Lookup的描述,似乎是将token的query/key当成一个vector。请问在具体实现中是如何处理多层与多头注意力的query/key呢?

Hi, the description in section 3.2 about the Representative Score and Memory Lookup seems to treat the token’s query/key as a single vector. How do you handle query/key for multi-layer and multi-head attention?

guyan364 commented 1 month ago

每层的注意力是独立处理的,实现上每层使用一个 ContextManager 代替原本的 kv cache。 query 和 key 计算相似性时不区分 head,可以理解为将所有 head 的 logits 加在一起,以 (block_size, hidden_size) 为单位lookup。 可以设置 perhead 启用每个 head 单独的 lookup,但是不推荐使用。 lookup 之后再使用正常的多头注意力计算。

The attention for each layer is processed independently. In the implementation, each layer uses a ContextManager instead of the original kv cache. When calculating the similarity between the query and key, no distinction is made between heads, which can be considered as summing the logits of all heads together, with a lookup unit in the shape of (block_size, hidden_size). You can set perhead to enable a separate lookup for each head, but this is not recommended. After the lookup, the normal multi-head attention calculation is used.

Becomebright commented 1 month ago

感谢回答!

我还有两个问题:

  1. 每层独立处理,那么一个block在各层可能会有不同的representative tokens,且lookup时各层可能会retrieve不同的block。请问我的理解是否正确?
  2. 为什么不把所有层的结果汇总起来进行lookup呢?
guyan364 commented 1 month ago
  1. 一个block在各层 的表述可能和实现有些出入,可以理解为每一层的 attention 维护自己的 memory units,一个 block 属于某一层 attention。
  2. 因为需要计算出当前的 query 而后和 represent vector 计算来做 lookup,每层的 query 依赖于上一层的计算。如果是指统一进行一次 lookup,可能更类似于 RAG。
Becomebright commented 1 month ago

谢谢!