Closed awni closed 2 days ago
Kind of works in 2-bit, but is all chinese for some reason:
mlx_lm.convert --hf-path tencent-community/Hunyuan-A52B-Instruct -q --q-bits 2 --q-group-size 32
mlx_lm.generate --model mlx_model --prompt "Write a story about Einstein" -m 100 --trust-remote-code
Outputs:
==========
Prompt: <|startoftext|><|startoftext|>Write a story about Einstein<|extra_4|><|extra_0|>
<|startoftext|>给出的问题是关于爱因斯坦的故事。以下是一个关于爱因斯坦的虚构故事:
在一个风和日丽的午后,阿尔伯特·爱因斯坦正坐在他位于柏林的办公室里,埋头于一份关于光速不变原理的论文。爱因因斯坦是一个著名的理论物理学家,他的相对论已经改变了科学界对时间和空间的理解。
突然,爱因斯坦的助手急匆匆地闯进了办公室,手里还拿着一封来自国际物理研究协会的信
==========
I tried it in 3-bit as well:
mlx_lm.convert --hf-path tencent-community/Hunyuan-A52B-Instruct -q --q-bits 3 --q-group-size 32
Works nicely on an M2 Ultra:
mlx_lm.generate --model mlx-community/Hunyuan-A52B-Instruct-3bit --prompt "Write a story about Einstein in English." -m 128 --system-prompt "You are a helpful AI assistant" --eos-token "<|eos|>"
I think this is mergable, the quality is pretty good so I don't think there is a bug. I suspect the default chinese output is due to the quantization (3-bit is quite a lot after all), but it's difficult to test anything higher. We can try a GPTQ / or distributed 8-bit to see if either helps improve that.
Currently too large to run on in 192GB even in 4-bit. Looking into mixed precision.