Add support for quantized qwen2-0.5b

mlc-ai / binary-mlc-llm-libs

167 stars 43 forks source link

Add support for quantized qwen2-0.5b #128

Closed bil-ash closed 3 weeks ago

bil-ash commented 3 weeks ago

Add support for quantized(q4f16) qwen2-0.5b . Wasm library taken from https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot/resolve/main/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm?download=true

Neet-Nestor commented 3 weeks ago

Related PRs:

https://github.com/mlc-ai/web-llm/pull/490
https://github.com/mlc-ai/web-llm-chat/pull/44
https://github.com/mlc-ai/binary-mlc-llm-libs/pull/128

bil-ash commented 3 weeks ago

I have renamed as suggested. By the way, what is prefill chunk size and how does it relate to memory usage and performance?

CharlieFRuan commented 3 weeks ago

Thanks! Say prefill chunk size is 2k, if a prompt is 4k, it will be prefilled twice instead of all at once. This helps reduce the size of the intermediate buffer for the matrix multiplication.