Hi, I noticed a mistake in the description of cross attention in 2.3. Attention in LLMs:
Cross Attention: It is used in encoder-decoder architectures, where encoder outputs are the queries, and key-value pairs come from the decoder.
In reality it is the other way around, the queries come from the decoder self-attention, while the encoder outputs act as key and value. See also here or here, chapter 3.2.3:
In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder.
Hi, I noticed a mistake in the description of cross attention in 2.3. Attention in LLMs:
In reality it is the other way around, the queries come from the decoder self-attention, while the encoder outputs act as key and value. See also here or here, chapter 3.2.3: