Speculative Decoding - Githubissues

To implement Speculative Decoding, LLM should run multiple queries with prefix(kv_cache) It means that we need attention for both kv_cache and queries

https://github.com/ita9naiwa/attention-impl/blob/25f00214763dbccdc859e194c66ca51771814b0c/packed_attention_kernel.cu#L425-L434

It takes q_1, ..., q_k, k_1,..., k_l, v_1, ..., v_l and kv_cache of size n and computes: [Attention(q_1, concat(k_cache, k_1), concat(v_cache, v_1)), Attention(q_2, concat(k_cache, k_1, k_2), concat(v_cache, v_1, v_2)), ..., Attention(q_l, concat(k_cache, k_1, k_2, ..., k_l), concat(v_cache, v_1, v_2, ..., v_l))]

So now I can embark on implementing Speculative decoding logic itself!

ita9naiwa / attention-impl

Speculative Decoding #8