Yeah, I would be very interested in seeing the results on some of the newer models:
qwen-1.5 and qwen-2 models (the 1.5 models actually seem better for creative writing; as also found in EQ-Bench).
DeepSeek-V2 and DeepSeek-Coder-V2 claim to have 128k context but have used "yarn" scaling on a 4k model.
Codestral-22B-v0.1 claims to have 32k but Mixtral-8x22B-Instruct (which people think was initialized with the same 22b unreleased base model) claims 64k but drops off quickly in the Ruler tests.
WizardLM-2-8x22B seems amazing for code (see here) and even thought was trained from the same Mixtral-8x22B base model as Mixtral-8x22B-Instruct; seems very different. It would be interesting to see if it had lost or gained any context length.
Yeah, I would be very interested in seeing the results on some of the newer models:
qwen-1.5
andqwen-2
models (the1.5
models actually seem better for creative writing; as also found in EQ-Bench).DeepSeek-V2
andDeepSeek-Coder-V2
claim to have128k
context but have used "yarn" scaling on a4k
model.Codestral-22B-v0.1
claims to have32k
butMixtral-8x22B-Instruct
(which people think was initialized with the same22b
unreleased base model) claims64k
but drops off quickly in the Ruler tests.WizardLM-2-8x22B
seems amazing for code (see here) and even thought was trained from the sameMixtral-8x22B
base model asMixtral-8x22B-Instruct
; seems very different. It would be interesting to see if it had lost or gained any context length.