sliding window size in prefill and decode stage

Hello,

I noticed that the sliding window size may be different in the prefill stage and the decode stage. As in the prefill stage, the current token is visible along with the recent sliding_window_size tokens(code here). However, in the decode stage, the current token is only visible with the recent sliding_window_size - 1 tokens. I'm wondering what is the purpose of this distinction? i.e. why the code is

mask = torch.triu(mask, diagonal=-self.args.sliding_window)

instead of

mask = torch.triu(mask, diagonal=-self.args.sliding_window + 1)

And by the way, could you please tell me if SWA was used during training? Thanks.

mistralai / mistral-inference

sliding window size in prefill and decode stage #60