$ mlc_llm compile rwkv-6-world-1b6-MLC/mlc-chat-config.json --device metal --host arm64-apple-darwin -o rwkv-6-world-1b6-MLC/libs/rwkv-6-world-1b6-MLC-q4f16-metal.so (or build with other hosts)
Device: MacBook Air m2; archlinux x86_64 with CUDA
How you installed MLC-LLM: python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
How you installed TVM-Unity (pip, source): python -m pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly
Python version (e.g. 3.10): 3.11
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): f65b73221a83b2a94c383c5d1b0bfd6d75c69800
š Bug
RWKV v6 models fail to compile with latest mlc_llm.
Edit: Also it seems that there's currently only rwkv v5 compiling test in ci. Should rwkv v6 be added in ci test too?
To Reproduce
Steps to reproduce the behavior:
$ mlc_llm convert_weight rwkv-6-world-1b6 --quantization q4f16_1 -o rwkv-6-world-1b6-MLC
$ mlc_llm gen_config rwkv-6-world-1b6 --quantization q4f16_1 --conv-template rwkv_world -o rwkv-6-world-1b6-MLC
$ mlc_llm compile rwkv-6-world-1b6-MLC/mlc-chat-config.json --device metal --host arm64-apple-darwin -o rwkv-6-world-1b6-MLC/libs/rwkv-6-world-1b6-MLC-q4f16-metal.so
(or build with other hosts)It gets the following error messages: (Click to expand)
``` $ mlc_llm compile rwkv-6-world-1b6-MLC/mlc-chat-config.json --device metal --host arm64-apple-darwin -o rwkv-6-world-1b6-MLC/libs/rwkv-6-world-1b6-MLC-q4f16-metal.so [2024-09-02 10:57:53] INFO auto_config.py:70: Found model configuration: rwkv-6-world-1b6-MLC/mlc-chat-config.json [2024-09-02 10:57:54] INFO auto_device.py:79: Found device: metal:0 [2024-09-02 10:57:54] INFO auto_target.py:78: Found configuration of target device "metal:0": {"thread_warp_size": runtime.BoxInt(32), "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]} [2024-09-02 10:57:54] INFO auto_target.py:114: Using LLVM triple specified by --host: arm64-apple-darwin [2024-09-02 10:57:54] INFO auto_config.py:154: Found model type: rwkv6. Use `--model-type` to override. Compiling with arguments: --config RWKV6Config(hidden_size=2048, intermediate_size=7168, num_hidden_layers=24, vocab_size=65536, model_version='6_0', tensor_parallel_shards=1, rescale_every=6, head_size=64, layer_norm_epsilon=1e-05, context_window_size=-1, prefill_chunk_size=4096, num_heads=32, max_batch_size=80, kwargs={}) --quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0) --model-type rwkv6 --target {"thread_warp_size": runtime.BoxInt(32), "host": {"kind": "llvm", "tag": "", "keys": ["arm_cpu", "cpu"], "mtriple": "arm64-apple-darwin"}, "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]} --opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE --system-lib-prefix "" --output rwkv-6-world-1b6-MLC/libs/rwkv-6-world-1b6-MLC-q4f16-metal.so --overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None [2024-09-02 10:57:54] INFO compile.py:140: Creating model from: RWKV6Config(hidden_size=2048, intermediate_size=7168, num_hidden_layers=24, vocab_size=65536, model_version='6_0', tensor_parallel_shards=1, rescale_every=6, head_size=64, layer_norm_epsilon=1e-05, context_window_size=-1, prefill_chunk_size=4096, num_heads=32, max_batch_size=80, kwargs={}) [2024-09-02 10:57:54] INFO compile.py:158: Exporting the model to TVM Unity compiler [2024-09-02 10:57:57] INFO compile.py:164: Running optimizations using TVM Unity [2024-09-02 10:57:57] INFO compile.py:185: Registering metadata: {'model_type': 'rwkv6', 'quantization': 'q4f16_1', 'context_window_size': -1, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 4096, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'rnn_state', 'max_batch_size': 80} [2024-09-02 10:57:57] INFO pipeline.py:54: Running TVM Relax graph-level optimizations [2024-09-02 10:57:59] INFO pipeline.py:54: Lowering to TVM TIR kernels [2024-09-02 10:58:04] INFO pipeline.py:54: Running TVM TIR-level optimizations [2024-09-02 10:58:22] INFO pipeline.py:54: Running TVM Dlight low-level optimizations [2024-09-02 10:58:27] INFO pipeline.py:54: Lowering to VM bytecode [2024-09-02 10:58:30] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 16.00 MB [2024-09-02 10:58:30] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 106.57 MB [2024-09-02 10:58:30] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 293.50 MB [2024-09-02 10:58:30] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 273.75 MB [2024-09-02 10:58:30] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_rnn_state`: 0.00 MB [2024-09-02 10:58:30] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 1.32 MB [2024-09-02 10:58:30] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 16.00 MB [2024-09-02 10:58:31] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 273.75 MB [2024-09-02 10:58:31] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB [2024-09-02 10:58:32] INFO pipeline.py:54: Compiling external modules [2024-09-02 10:58:32] INFO pipeline.py:54: Compilation complete! Exporting to disk Traceback (most recent call last): File "/Users/molly/miniconda3/envs/mlc-llm-latest/bin/mlc_llm", line 8, inExpected behavior
Model lib successfully compiles.
Environment
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
pip
, source):python -m pip install --pre -U -f https://mlc.ai/wheels mlc-ai-nightly
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): f65b73221a83b2a94c383c5d1b0bfd6d75c69800