mlc-ai / binary-mlc-llm-libs

167 stars 43 forks source link

Add support for quantized qwen2-0.5b #128

Closed bil-ash closed 3 weeks ago

bil-ash commented 3 weeks ago

Add support for quantized(q4f16) qwen2-0.5b . Wasm library taken from https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot/resolve/main/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm?download=true

Neet-Nestor commented 3 weeks ago

Related PRs:

bil-ash commented 3 weeks ago

I have renamed as suggested. By the way, what is prefill chunk size and how does it relate to memory usage and performance?

CharlieFRuan commented 3 weeks ago

Thanks! Say prefill chunk size is 2k, if a prompt is 4k, it will be prefilled twice instead of all at once. This helps reduce the size of the intermediate buffer for the matrix multiplication.