Open hscspring opened 8 months ago
I also have the same question. Here is the commit that added this code. @flu0r1ne Can you please explain more on this? Any help would be appreciated.
It allows the underlying model's KV cache to be maintained between interleaved messages. Prior to this change, the KV cache had to be re-computed between each message (but not within the auto-regressive loop.) The generation code in this repository does not use this property, but it was a bug in the wrapper for the underlying model. If you hook it yourself (as I was doing), you can achieve a speed up. See #899.
Here is the code in model.py (line 482)
Except the prompt input, the followed generated tokens are all only one token (seqlen=1). It means this mask operation only used for the first input(with prompt), and so the
start_pos
is always zero, thehstack
operation here actually doesn't do anything.Is anyone who knows the effect here?