Closed jparismorgan closed 1 year ago
@jparismorgan Thanks for bringing it up. Yes this is an known issue that we noticed short ago. The fix is not as trivial and we are working on this. Sorry for not catching the issue at the beginning of release.
Just noticed that the prebuilt Llama2 q4f32_1 lib in https://webllm.mlc.ai is actually good.
Are you using the lib you just built? If so you might need to wait us for a fix. Or you can use the prebuilt q4f32_1 wasm in https://github.com/mlc-ai/binary-mlc-llm-libs
Hi, this issue should have been addressed. You can update the mlc-ai pip package again and the issue should have gone.
🐛 Bug
When running the model produced by
python3 -m mlc_llm.build --hf-path meta-llama/Llama-2-7b-chat-hf --target webgpu --quantization q4f32_1
on the web I getInit error, GPUPipelineError: The total number of workgroup invocations (512) exceeds the maximum allowed (256).
.To Reproduce
Steps to reproduce the behavior:
First I have to modify
fuse_split_rotary_embedding.py
as specified here: https://github.com/mlc-ai/mlc-llm/issues/816#issuecomment-1694558023 - I just replace all instances offloat16
withfloat32
infuse_split_rotary_embedding.py
.I then compile llama2:
Then I update
examples/simple-chat/src/mlc-local-config.js
to:Then I start the local serve server:
And then start the example app:
Server running at http://localhost:8888 ✨ Built in 174ms
Init error, GPUPipelineError: The total number of workgroup invocations (512) exceeds the maximum allowed (256).