I am trying to add Qwen2Moe support

ExLlama supports Qwen2, but Qwen2MoE is a different architecture. It will need its own definition in architecture.py with the appropriate flags set for bias etc.

Main reason I haven't had time to add it yet is that the shared expert complicates quantization a lot. It's a regular MLP layer that runs parallel to the sparse MLP layer, and its output is gated and added to the residual stream. This is simple enough in PyTorch but requires some extra stuff to also work it into the measurement/quantization functions.

As for prompt formatting, that isn't really the responsibility of the model implementation. If you're working with an instruct-tuned model it should of course prefer a correctly formatted prompt, which for Qwen2 (since it's ChatML) would look like:

<|im_start|>user
Where does snow come from?<|im_end|>
<|im_start|>assistant

But I've yet to see an instruct model that couldn't do a regular completion from a naked prompt like Once upon a time,.

turboderp / exllamav2

I am trying to add Qwen2Moe support #545