$ mlc_chat compile /data/models/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json --device cuda -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-MLC.so
[2024-03-13 03:21:56] INFO auto_config.py:69: Found model configuration: /data/models/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json
[2024-03-13 03:21:58] INFO auto_device.py:76: Found device: cuda:0
[2024-03-13 03:21:58] INFO auto_target.py:70: Found configuration of target device "cuda:0": {"thread_warp_size": 32, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
[2024-03-13 03:21:58] INFO auto_target.py:102: Found host LLVM triple: x86_64-redhat-linux-gnu
[2024-03-13 03:21:58] INFO auto_target.py:103: Found host LLVM CPU: icelake-server
[2024-03-13 03:21:58] INFO auto_target.py:269: Generating code for CUDA architecture: sm_86
[2024-03-13 03:21:58] INFO auto_target.py:270: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90
[2024-03-13 03:21:58] INFO auto_config.py:153: Found model type: llama. Use --model-type to override.
Compiling with arguments:
--config LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type llama
--target {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "icelake-server", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": 1024, "libs": ["thrust"], "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]}
--opt flashinfer=1;cublas_gemm=0;faster_transformer=1;cudagraph=0
--system-lib-prefix ""
--output dist/libs/Llama-2-7b-chat-hf-q4f16_1-MLC.so
--overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None
[2024-03-13 03:21:58] INFO compile.py:136: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={})
[2024-03-13 03:21:58] INFO compile.py:155: Exporting the model to TVM Unity compiler
[03:21:59] /workspace/tvm/src/relax/analysis/well_formed.cc:125: Warning: This IR is not well formed: Function 0x55d9163752a0 is annotated as pure but contains an impure call: R.call_packed("mlc.create_paged_kv_cache_generic", R.shape([max_batch_size, max_total_seq_len, prefill_chunk_size, page_size]), R.prim_value(32), R.prim_value(32), R.prim_value(32), R.prim_value(128), R.prim_value(1), R.prim_value(1), R.prim_value(10000), R.prim_value(128), R.dtype("float16"), sinfo_args=(R.Object,)). Please set relax.force_pure to true or use a pure operator variant (e.g., call_pure_packed) if it is necessary to override this judgment.
[03:21:59] /workspace/tvm/src/relax/analysis/well_formed.cc:125: Warning: This IR is not well formed: Impure function call R.call_packed("mlc.create_paged_kv_cache_generic", R.shape([max_batch_size, max_total_seq_len, prefill_chunk_size, page_size]), R.prim_value(32), R.prim_value(32), R.prim_value(32), R.prim_value(128), R.prim_value(1), R.prim_value(1), R.prim_value(10000), R.prim_value(128), R.dtype("float16"), sinfo_args=(R.Object,)) occurs within a dataflow block.
Traceback (most recent call last):
File "/usr/local/bin/mlc_chat", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/main.py", line 24, in main
cli.main(sys.argv[2:])
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/cli/compile.py", line 131, in main
compile(
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/interface/compile.py", line 230, in compile
_compile(args, model_config)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/interface/compile.py", line 156, in _compile
mod, named_params, ext_mods = model.export_tvm(
File "/usr/local/lib/python3.10/dist-packages/tvm/relax/frontend/nn/core.py", line 489, in export_tvm
mod, params, ext_mods = Exporter(debug=debug).build(spec)
File "/usr/local/lib/python3.10/dist-packages/tvm/relax/frontend/nn/exporter.py", line 139, in build
assert rx.analysis.well_formed(mod)
AssertionError
Expected behavior
Environment
Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA):
Operating system (e.g. Ubuntu/Windows/MacOS/...):
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...)
How you installed MLC-LLM (conda, source):
How you installed TVM-Unity (pip, source):
Python version (e.g. 3.10):
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
installed mlc mlc-ai-nightly-cu122 0.15.dev127 mlc-chat-nightly-cu122 0.1.dev975
git clone https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC
then do compile
$ mlc_chat compile /data/models/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json --device cuda -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-MLC.so [2024-03-13 03:21:56] INFO auto_config.py:69: Found model configuration: /data/models/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json [2024-03-13 03:21:58] INFO auto_device.py:76: Found device: cuda:0 [2024-03-13 03:21:58] INFO auto_target.py:70: Found configuration of target device "cuda:0": {"thread_warp_size": 32, "arch": "sm_86", "max_threads_per_block": 1024, "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]} [2024-03-13 03:21:58] INFO auto_target.py:102: Found host LLVM triple: x86_64-redhat-linux-gnu [2024-03-13 03:21:58] INFO auto_target.py:103: Found host LLVM CPU: icelake-server [2024-03-13 03:21:58] INFO auto_target.py:269: Generating code for CUDA architecture: sm_86 [2024-03-13 03:21:58] INFO auto_target.py:270: To produce multi-arch fatbin, set environment variable MLC_MULTI_ARCH. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90 [2024-03-13 03:21:58] INFO auto_config.py:153: Found model type: llama. Use
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/main.py", line 24, in main
cli.main(sys.argv[2:])
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/cli/compile.py", line 131, in main
compile(
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/interface/compile.py", line 230, in compile
_compile(args, model_config)
File "/usr/local/lib/python3.10/dist-packages/mlc_chat/interface/compile.py", line 156, in _compile
mod, named_params, ext_mods = model.export_tvm(
File "/usr/local/lib/python3.10/dist-packages/tvm/relax/frontend/nn/core.py", line 489, in export_tvm
mod, params, ext_mods = Exporter(debug=debug).build(spec)
File "/usr/local/lib/python3.10/dist-packages/tvm/relax/frontend/nn/exporter.py", line 139, in build
assert rx.analysis.well_formed(mod)
AssertionError
--model-type
to override. Compiling with arguments: --config LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={}) --quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7) --model-type llama --target {"thread_warp_size": 32, "host": {"mtriple": "x86_64-redhat-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "icelake-server", "keys": ["cpu"]}, "arch": "sm_86", "max_threads_per_block": 1024, "libs": ["thrust"], "max_num_threads": 1024, "kind": "cuda", "max_shared_memory_per_block": 49152, "tag": "", "keys": ["cuda", "gpu"]} --opt flashinfer=1;cublas_gemm=0;faster_transformer=1;cudagraph=0 --system-lib-prefix "" --output dist/libs/Llama-2-7b-chat-hf-q4f16_1-MLC.so --overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None [2024-03-13 03:21:58] INFO compile.py:136: Creating model from: LlamaConfig(hidden_size=4096, intermediate_size=11008, num_attention_heads=32, num_hidden_layers=32, rms_norm_eps=1e-06, vocab_size=32000, position_embedding_base=10000, context_window_size=4096, prefill_chunk_size=4096, num_key_value_heads=32, head_dim=128, tensor_parallel_shards=1, max_batch_size=1, kwargs={}) [2024-03-13 03:21:58] INFO compile.py:155: Exporting the model to TVM Unity compiler [03:21:59] /workspace/tvm/src/relax/analysis/well_formed.cc:125: Warning: This IR is not well formed: Function 0x55d9163752a0 is annotated as pure but contains an impure call: R.call_packed("mlc.create_paged_kv_cache_generic", R.shape([max_batch_size, max_total_seq_len, prefill_chunk_size, page_size]), R.prim_value(32), R.prim_value(32), R.prim_value(32), R.prim_value(128), R.prim_value(1), R.prim_value(1), R.prim_value(10000), R.prim_value(128), R.dtype("float16"), sinfo_args=(R.Object,)). Please set relax.force_pure to true or use a pure operator variant (e.g., call_pure_packed) if it is necessary to override this judgment. [03:21:59] /workspace/tvm/src/relax/analysis/well_formed.cc:125: Warning: This IR is not well formed: Impure function call R.call_packed("mlc.create_paged_kv_cache_generic", R.shape([max_batch_size, max_total_seq_len, prefill_chunk_size, page_size]), R.prim_value(32), R.prim_value(32), R.prim_value(32), R.prim_value(128), R.prim_value(1), R.prim_value(1), R.prim_value(10000), R.prim_value(128), R.dtype("float16"), sinfo_args=(R.Object,)) occurs within a dataflow block. Traceback (most recent call last): File "/usr/local/bin/mlc_chat", line 8, inExpected behavior
Environment
conda
, source):pip
, source):python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context