mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.64k stars 850 forks source link

sliding window size in prefill and decode stage #60

Open ofhwei opened 11 months ago

ofhwei commented 11 months ago

Hello,

I noticed that the sliding window size may be different in the prefill stage and the decode stage. As in the prefill stage, the current token is visible along with the recent sliding_window_size tokens(code here). However, in the decode stage, the current token is only visible with the recent sliding_window_size - 1 tokens. I'm wondering what is the purpose of this distinction? i.e. why the code is

mask = torch.triu(mask, diagonal=-self.args.sliding_window)

instead of

mask = torch.triu(mask, diagonal=-self.args.sliding_window + 1)

And by the way, could you please tell me if SWA was used during training? Thanks.