mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.23k stars 1.58k forks source link

[Bug] macbook pro m4 max apple silicon mlc_llm compile qwen2.5 q4f32 mlc .so error #3036

Closed l241025097 closed 11 hours ago

l241025097 commented 2 days ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

mlc_llm compile /path/to/models/qwen/Qwen2.5-32B-Instruct-q4f32_1-MLC/mlc-chat-config.json \ -o /path/to/models/qwen/Qwen2.5-32B-Instruct-q4f32_1-MLC/libs/Qwen2.5-32B-Instruct-q4f32_1-MLC-metal.so

Expected behavior

Environment

Additional context

[11:34:19] /Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.4 with -mcpu=apple-latest is not valid in -mtriple=arm64-apple-macos, using default -mcpu=generic [11:34:19] /Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.4 with -mcpu=apple-latest is not valid in -mtriple=arm64-apple-macos, using default -mcpu=generic [11:34:19] /Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.4 with -mcpu=apple-latest is not valid in -mtriple=arm64-apple-macos, using default -mcpu=generic CLML Target Version: 3 [2024-11-20 11:34:19] INFO auto_config.py:70: Found model configuration: /Users/lyn/Documents/python/learn/mlc_llm/models/qwen/Qwen2.5-32B-Instruct-q4f32_1-MLC/mlc-chat-config.json [2024-11-20 11:34:20] INFO auto_device.py:88: Not found device: cuda:0 [2024-11-20 11:34:20] INFO auto_device.py:88: Not found device: rocm:0 [2024-11-20 11:34:21] INFO auto_device.py:79: Found device: metal:0 [2024-11-20 11:34:21] INFO auto_device.py:88: Not found device: vulkan:0 [2024-11-20 11:34:22] INFO auto_device.py:88: Not found device: opencl:0 [2024-11-20 11:34:22] INFO auto_device.py:35: Using device: metal:0 [2024-11-20 11:34:22] INFO auto_target.py:78: Found configuration of target device "metal:0": {"thread_warp_size": runtime.BoxInt(32), "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]} [11:34:22] /Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/src/target/llvm/llvm_instance.cc:226: Error: Using LLVM 19.1.4 with -mcpu=apple-m3 is not valid in -mtriple=arm64-apple-darwin24.1.0, using default -mcpu=generic [2024-11-20 11:34:22] INFO auto_target.py:110: Found host LLVM triple: arm64-apple-darwin24.1.0 [2024-11-20 11:34:22] INFO auto_target.py:111: Found host LLVM CPU: apple-m3 [2024-11-20 11:34:22] INFO auto_config.py:154: Found model type: qwen2. Use --model-type to override. Compiling with arguments: --config QWen2Config(hidden_act='silu', hidden_size=5120, intermediate_size=27648, num_attention_heads=40, num_hidden_layers=64, num_key_value_heads=8, rms_norm_eps=1e-06, rope_theta=1000000.0, vocab_size=152064, tie_word_embeddings=False, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={}) --quantization GroupQuantize(name='q4f32_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float32', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7, tensor_parallel_shards=0) --model-type qwen2 --target {"thread_warp_size": runtime.BoxInt(32), "host": {"mtriple": "arm64-apple-darwin24.1.0", "tag": "", "kind": "llvm", "mcpu": "apple-m3", "keys": ["arm_cpu", "cpu"]}, "max_threads_per_block": runtime.BoxInt(1024), "max_function_args": runtime.BoxInt(31), "max_num_threads": runtime.BoxInt(256), "kind": "metal", "max_shared_memory_per_block": runtime.BoxInt(32768), "tag": "", "keys": ["metal", "gpu"]} --opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE --system-lib-prefix "" --output /Users/lyn/Documents/python/learn/mlc_llm/models/qwen/Qwen2.5-32B-Instruct-q4f32_1-MLC/libs/Qwen2.5-32B-Instruct-q4f32_1-MLC-metal.so --overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None [2024-11-20 11:34:22] INFO compile.py:140: Creating model from: QWen2Config(hidden_act='silu', hidden_size=5120, intermediate_size=27648, num_attention_heads=40, num_hidden_layers=64, num_key_value_heads=8, rms_norm_eps=1e-06, rope_theta=1000000.0, vocab_size=152064, tie_word_embeddings=False, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={}) [2024-11-20 11:34:22] INFO compile.py:158: Exporting the model to TVM Unity compiler [2024-11-20 11:34:24] INFO compile.py:164: Running optimizations using TVM Unity [2024-11-20 11:34:24] INFO compile.py:185: Registering metadata: {'model_type': 'qwen2', 'quantization': 'q4f32_1', 'context_window_size': 32768, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 80} [2024-11-20 11:34:25] INFO pipeline.py:54: Running TVM Relax graph-level optimizations [2024-11-20 11:34:27] INFO pipeline.py:54: Lowering to TVM TIR kernels [2024-11-20 11:34:31] INFO pipeline.py:54: Running TVM TIR-level optimizations [2024-11-20 11:34:39] INFO pipeline.py:54: Running TVM Dlight low-level optimizations [2024-11-20 11:34:44] INFO pipeline.py:54: Lowering to VM bytecode [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function alloc_embedding_tensor: 40.00 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_decode: 80.16 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_prefill: 911.97 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function batch_verify: 2052.00 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function create_tir_paged_kv_cache: 0.00 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function decode: 1.00 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function embed: 40.00 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function prefill: 864.60 MB [2024-11-20 11:34:45] INFO estimate_memory_usage.py:58: [Memory usage] Function softmax_with_temperature: 0.00 MB [2024-11-20 11:34:46] INFO pipeline.py:54: Compiling external modules [2024-11-20 11:34:46] INFO pipeline.py:54: Compilation complete! Exporting to disk Traceback (most recent call last): File "/Users/lyn/Applications/miniforge3/envs/mlc_llm_dev/bin/mlc_llm", line 33, in sys.exit(load_entry_point('mlc-llm', 'console_scripts', 'mlc_llm')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/lyn/Documents/python/learn/mlc_llm/modules/mlc-llm/python/mlc_llm/main.py", line 33, in main cli.main(sys.argv[2:]) File "/Users/lyn/Documents/python/learn/mlc_llm/modules/mlc-llm/python/mlc_llm/cli/compile.py", line 129, in main compile( File "/Users/lyn/Documents/python/learn/mlc_llm/modules/mlc-llm/python/mlc_llm/interface/compile.py", line 243, in compile _compile(args, model_config) File "/Users/lyn/Documents/python/learn/mlc_llm/modules/mlc-llm/python/mlc_llm/interface/compile.py", line 188, in _compile args.build_func( File "/Users/lyn/Documents/python/learn/mlc_llm/modules/mlc-llm/python/mlc_llm/support/auto_target.py", line 301, in build relax.build( File "/Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/python/tvm/relax/vm_build.py", line 353, in build return _vmlink( ^^^^^^^^ File "/Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/python/tvm/relax/vm_build.py", line 249, in _vmlink lib = tvm.build( ^^^^^^^^^^ File "/Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/python/tvm/driver/build_module.py", line 297, in build rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/python/tvm/_ffi/_ctypes/packed_func.py", line 245, in call raise_last_ffi_error() File "/Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error raise py_err tvm.error.InternalError: Traceback (most recent call last): File "/Users/lyn/Documents/python/learn/mlc_llm/modules/tvm-unity/src/tir/transforms/storage_rewrite.cc", line 1494 InternalError: Check failed: (me->coeff == 0 || info.factor() % me->coeff == 0) is false:

MasterJH5574 commented 1 day ago

Thank you @l241025097 for reporting. We'll take a look.

MasterJH5574 commented 22 hours ago

Hi @l241025097, could you please upgrade the nightly package to the latest and try again? We fixed it and the issue is likely gone.

l241025097 commented 11 hours ago

非常感谢,通过源码编译安装最新的nightly版本确实解决了。