Qwen2 and DeepSeek-V2 results?

Yeah, I would be very interested in seeing the results on some of the newer models:

qwen-1.5 and qwen-2 models (the 1.5 models actually seem better for creative writing; as also found in EQ-Bench).
DeepSeek-V2 and DeepSeek-Coder-V2 claim to have 128k context but have used "yarn" scaling on a 4k model.
Codestral-22B-v0.1 claims to have 32k but Mixtral-8x22B-Instruct (which people think was initialized with the same 22b unreleased base model) claims 64k but drops off quickly in the Ruler tests.
WizardLM-2-8x22B seems amazing for code (see here) and even thought was trained from the same Mixtral-8x22B base model as Mixtral-8x22B-Instruct; seems very different. It would be interesting to see if it had lost or gained any context length.

hsiehjackson / RULER