A reviewer requested a comparison of the DiVA methodology with Qwen, so I added rough and dirty support for their HF checkpoints trying to re-use as many things from Llama as possible!
I didn't add support for their sliding window attention mask since I'm not experimenting with long-context, but happy to close this and add it later if that's a blocker for merging this.
A reviewer requested a comparison of the DiVA methodology with Qwen, so I added rough and dirty support for their HF checkpoints trying to re-use as many things from Llama as possible!
I didn't add support for their sliding window attention mask since I'm not experimenting with long-context, but happy to close this and add it later if that's a blocker for merging this.