Open icoderzqliu opened 5 days ago
Hi,
Thank you for your interest in our work.
Thanks for your question regarding the identification process during the prefill stage. I have commented out the lines related to this and plan to update the documentation shortly.
The unnecessary cache is discarded during the first step of decoding, eliminating the need for further discarding in subsequent operations.
In section A.2 of the paper, it states that SimLayerKV will be used in both the prefill and decode stages. However, in the code, I see:
During the prefill stage,
bos_weight
is set to 0. In the compression process at this line, compression is not performed. I would like to ask if compression is only carried out in the decode stage, or is there an issue with the code implementation?In this line, the condition is true only when decoding the first token, allowing for subsequent compression. However, during the later decoding process, the second condition
torch.tensor(bos_weights).sum() != 0
causes this check to be false, preventing further compression. Could you please clarify this situation?Thank you for your response.