I have been fine-tuning Mistral-7B-Instruct-v0.2 recently and I noticed that when I don't use SWA and train with a sequence length of 32K, the initial loss is unusually high (6.0). However, when I train with a sequence length of 4096, the loss is normal (1.5). This has led me to suspect that Mistral-7B-Instruct-v0.2 might have been trained with a sliding window of 4096 instead of not using it as officially stated.
I have been fine-tuning Mistral-7B-Instruct-v0.2 recently and I noticed that when I don't use SWA and train with a sequence length of 32K, the initial loss is unusually high (6.0). However, when I train with a sequence length of 4096, the loss is normal (1.5). This has led me to suspect that Mistral-7B-Instruct-v0.2 might have been trained with a sliding window of 4096 instead of not using it as officially stated.