I noticed that the sliding window size may be different in the prefill stage and the decode stage. As in the prefill stage, the current token is visible along with the recent sliding_window_size tokens(code here). However, in the decode stage, the current token is only visible with the recent sliding_window_size - 1 tokens. I'm wondering what is the purpose of this distinction? i.e. why the code is
Hello,
I noticed that the sliding window size may be different in the prefill stage and the decode stage. As in the prefill stage, the current token is visible along with the recent
sliding_window_size
tokens(code here). However, in the decode stage, the current token is only visible with the recentsliding_window_size - 1
tokens. I'm wondering what is the purpose of this distinction? i.e. why the code isinstead of
And by the way, could you please tell me if SWA was used during training? Thanks.