Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)?

mistralai / mistral-inference

Official inference library for Mistral models

https://mistral.ai/

Apache License 2.0

9.16k stars 804 forks source link

Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)? #148

Open matrixssy opened 2 months ago

matrixssy commented 2 months ago

I have been fine-tuning Mistral-7B-Instruct-v0.2 recently and I noticed that when I don't use SWA and train with a sequence length of 32K, the initial loss is unusually high (6.0). However, when I train with a sequence length of 4096, the loss is normal (1.5). This has led me to suspect that Mistral-7B-Instruct-v0.2 might have been trained with a sliding window of 4096 instead of not using it as officially stated.

MrYxJ commented 3 weeks ago

According to the documentation in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, Mistral-7B-Instruct-v0.2 does not use SWA.

pandora-s-git commented 3 weeks ago

Both v0.2 and v0.3 do not use SWA indeed!