mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.08k stars 1.56k forks source link

[Bug] Running Llama 2 with WebGPU `q4f32_1` gives `GPUPipelineError: The total number of workgroup invocations (512) exceeds the maximum allowed (256).` #838

Closed jparismorgan closed 1 year ago

jparismorgan commented 1 year ago

🐛 Bug

When running the model produced by python3 -m mlc_llm.build --hf-path meta-llama/Llama-2-7b-chat-hf --target webgpu --quantization q4f32_1 on the web I get Init error, GPUPipelineError: The total number of workgroup invocations (512) exceeds the maximum allowed (256)..

To Reproduce

Steps to reproduce the behavior:

  1. First I have to modify fuse_split_rotary_embedding.py as specified here: https://github.com/mlc-ai/mlc-llm/issues/816#issuecomment-1694558023 - I just replace all instances of float16 with float32 in fuse_split_rotary_embedding.py.

  2. I then compile llama2:

    (mlc-llm) ~/repo/mlc-llm python3 -m mlc_llm.build --hf-path meta-llama/Llama-2-7b-chat-hf --target webgpu --quantization q4f32_1
    Weights exist at dist/models/Llama-2-7b-chat-hf, skipping download.
    Using path "dist/models/Llama-2-7b-chat-hf" for model "Llama-2-7b-chat-hf"
    Target configured: webgpu -keys=webgpu,gpu -max_num_threads=256
    Load cached module from dist/Llama-2-7b-chat-hf-q4f32_1/mod_cache_before_build.pkl and skip tracing. You can use --use-cache=0 to retrace
    [14:08:22] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/target/llvm/codegen_llvm.cc:185: Warning: Set native vector bits to be 128 for wasm32
    Finish exporting to dist/Llama-2-7b-chat-hf-q4f32_1/Llama-2-7b-chat-hf-q4f32_1-webgpu.wasm
  3. Then I update examples/simple-chat/src/mlc-local-config.js to:

    // config used when serving from local mlc-llm/dist
    // use web-llm/script/serve_mlc_llm_dist.sh to start the artifact server
    export default {
    "model_list": [
    {
      "model_url": "http://localhost:8000/Llama-2-7b-chat-hf-q4f32_1/params/",
      "local_id": "Llama-2-7b-chat-hf-q4f32_1"
    },
    ],
    "model_lib_map": {
    "Llama-2-7b-chat-hf-q4f32_1": "http://localhost:8000/Llama-2-7b-chat-hf-q4f32_1/Llama-2-7b-chat-hf-q4f32_1-webgpu.wasm",
    },
    "use_web_worker": true
    }
  4. Then I start the local serve server:

    ~/repo/web-llm ./scripts/serve_mlc_llm_dist.sh
  5. And then start the example app:

    
    ~/repo/web-llm/examples/simple-chat npm run mlc-local                                                           ✹main 

simple-chat@0.1.0 mlc-local cp src/mlc-local-config.js src/app-config.js && parcel src/llm_chat.html --port 8888

Server running at http://localhost:8888 ✨ Built in 174ms

And I get:

<img width="1576" alt="Screenshot 2023-08-30 at 3 39 34 PM" src="https://github.com/mlc-ai/mlc-llm/assets/1396242/ac73add3-c8b1-4507-9616-3e4732090fed">

With this error:

Init error, GPUPipelineError: The total number of workgroup invocations (512) exceeds the maximum allowed (256).



## Expected behavior

I can run Llama 2 on the web.

## Environment

 - Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): WebGPU
 - Operating system (e.g. Ubuntu/Windows/MacOS/...): MacOS
 - Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): Chrome Browser
 - How you installed MLC-LLM (`conda`, source): source
 - How you installed TVM-Unity (`pip`, source): pip
 - Python version (e.g. 3.10): 3.11.5
 - GPU driver version (if applicable):
 - CUDA/cuDNN version (if applicable):
 - TVM Unity Hash Tag (`python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"`, applicable if you compile models):
USE_GTEST: AUTO
SUMMARIZE: OFF
USE_IOS_RPC: OFF
USE_ETHOSU: 
CUDA_VERSION: NOT-FOUND
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM: 
USE_OPENCL_GTEST: /path/to/opencl/gtest
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_VITIS_AI: OFF
USE_LLVM: llvm-config --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 2b204c39b53912814edc3f07e88919a5c76d00cf
USE_VULKAN: OFF
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2023-08-08 17:21:25 -0400
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: OFF
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 15.0.7
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_CUBLAS: OFF
USE_METAL: ON
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION: 
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /Library/Developer/CommandLineTools/usr/bin/c++
HIDE_PRIVATE_SYMBOLS: ON

## Additional context

Thank you!
MasterJH5574 commented 1 year ago

@jparismorgan Thanks for bringing it up. Yes this is an known issue that we noticed short ago. The fix is not as trivial and we are working on this. Sorry for not catching the issue at the beginning of release.

MasterJH5574 commented 1 year ago

Just noticed that the prebuilt Llama2 q4f32_1 lib in https://webllm.mlc.ai is actually good.

image

Are you using the lib you just built? If so you might need to wait us for a fix. Or you can use the prebuilt q4f32_1 wasm in https://github.com/mlc-ai/binary-mlc-llm-libs

MasterJH5574 commented 1 year ago

Hi, this issue should have been addressed. You can update the mlc-ai pip package again and the issue should have gone.