Open qiyuxinlin opened 4 months ago
thanks for your help! I have successfully run Qwen2MoE in the non-quantized version. Next, I need to study the code of your quantification part. There are some errors in what I commented before. The reason why the model does not output is bos_token_id, eos_token_id. The model I am running is Qwen1.5-MoE-A2.7B-Chat. I will try Qwen2-57B-A14B when I have a chance later, thank you again.
ExLlama supports Qwen2, but Qwen2MoE is a different architecture. It will need its own definition in architecture.py with the appropriate flags set for bias etc.
Main reason I haven't had time to add it yet is that the shared expert complicates quantization a lot. It's a regular MLP layer that runs parallel to the sparse MLP layer, and its output is gated and added to the residual stream. This is simple enough in PyTorch but requires some extra stuff to also work it into the measurement/quantization functions.
As for prompt formatting, that isn't really the responsibility of the model implementation. If you're working with an instruct-tuned model it should of course prefer a correctly formatted prompt, which for Qwen2 (since it's ChatML) would look like:
But I've yet to see an instruct model that couldn't do a regular completion from a naked prompt like
Once upon a time,
.