I am trying to optimise the Qwen/Qwen1.5-4B-Chat model. As I have only 8GB RAM on my MAC M1, I use 3bit quantisation and a really small prefill chunk size = 2048. I get the following error on running mlc_llm chat $IR_FILES
[2024-07-28 00:47:14] INFO auto_config.py:70: Found model configuration: dist/shards/mlc-chat-config.json
[2024-07-28 00:47:14] INFO auto_target.py:84: Detecting target device: metal:0
[2024-07-28 00:47:14] INFO auto_target.py:86: Found target: {"thread_warp_size": 32, "max_threads_per_block": 1024, "max_function_args": 31, "max_num_threads": 256, "kind": "metal", "max_shared_memory_per_block": 32768, "tag": "", "keys": ["metal", "gpu"]}
[2024-07-28 00:47:14] INFO auto_target.py:103: Found host LLVM triple: arm64-apple-darwin22.2.0
[2024-07-28 00:47:14] INFO auto_target.py:104: Found host LLVM CPU: apple-m1
[2024-07-28 00:47:14] INFO auto_config.py:154: Found model type: qwen2. Use `--model-type` to override.
Compiling with arguments:
--config QWen2Config(hidden_act='silu', hidden_size=2560, intermediate_size=6912, num_attention_heads=20, num_hidden_layers=40, num_key_value_heads=20, rms_norm_eps=1e-06, rope_theta=5000000.0, vocab_size=151936, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={})
--quantization GroupQuantize(name='q3f16_1', kind='group-quant', group_size=40, quantize_dtype='int3', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=10, num_storage_per_group=4, max_int_value=3)
--model-type qwen2
--target {"thread_warp_size": 32, "host": {"mtriple": "arm64-apple-darwin22.2.0", "tag": "", "kind": "llvm", "mcpu": "apple-m1", "keys": ["arm_cpu", "cpu"]}, "max_threads_per_block": 1024, "max_function_args": 31, "max_num_threads": 256, "kind": "metal", "max_shared_memory_per_block": 32768, "tag": "", "keys": ["metal", "gpu"]}
--opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output /var/folders/3d/_9ftlcj54396cwckpfssmw_h0000gn/T/tmpuh24ym5s/lib.dylib
--overrides context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-07-28 00:47:14] INFO config.py:107: Overriding tensor_parallel_shards from 1 to 1
[2024-07-28 00:47:14] INFO compile.py:127: Creating model from: QWen2Config(hidden_act='silu', hidden_size=2560, intermediate_size=6912, num_attention_heads=20, num_hidden_layers=40, num_key_value_heads=20, rms_norm_eps=1e-06, rope_theta=5000000.0, vocab_size=151936, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={})
[2024-07-28 00:47:14] INFO compile.py:145: Exporting the model to TVM Unity compiler
[2024-07-28 00:47:17] INFO compile.py:151: Running optimizations using TVM Unity
[2024-07-28 00:47:17] INFO compile.py:171: Registering metadata: {'model_type': 'qwen2', 'quantization': 'q3f16_1', 'context_window_size': 32768, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-07-28 00:47:18] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-07-28 00:47:26] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-07-28 00:47:33] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-07-28 00:47:59] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-07-28 00:48:01] INFO pipeline.py:52: Lowering to VM bytecode
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 10.00 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 50.70 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 157.76 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 1298.00 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.63 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 10.00 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 111.58 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-07-28 00:48:07] INFO pipeline.py:52: Compiling external modules
[2024-07-28 00:48:07] INFO pipeline.py:52: Compilation complete! Exporting to disk
[2024-07-28 00:48:17] INFO model_metadata.py:95: Total memory usage without KV cache:: 2994.43 MB (Parameters: 1696.43 MB. Temporary buffer: 1298.00 MB)
[2024-07-28 00:48:17] INFO model_metadata.py:103: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-07-28 00:48:17] INFO compile.py:193: Generated: /var/folders/3d/_9ftlcj54396cwckpfssmw_h0000gn/T/tmpuh24ym5s/lib.dylib
[2024-07-28 00:48:17] INFO jit.py:128: Using compiled model lib: /Users/prashantdandriyal/.cache/mlc_llm/model_lib/82941e4cf5dae160d69bd8844e5ef61e.dylib
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 593, prefill chunk size will be set to 593.
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 611, prefill chunk size will be set to 611.
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 138, prefill chunk size will be set to 2048.
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:701: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 611, prefill chunk size is 611.
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:706: Estimated total single GPU memory usage: 4641.777 MB (Parameters: 1696.427 MB. KVCache: 327.001 MB. Temporary buffer: 2618.349 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out stats of last request (token/sec)
/metrics print out full engine metrics
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: TVMError: Assert fail: T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1]) == 256, Argument fused_fused_dequantize_take1.p_lv2656.shape[1] has an unsatisfied constraint: 256 == T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1])
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): f5f048bbd71513f087799f987019e3931f68a6d9
Thanks for reporting. We will look into the potential issue of q3f16_1. Given it is a 4.5B model, would you mind trying the 4-bit quantization q4f16_1?
🐛 Bug
I am trying to optimise the
Qwen/Qwen1.5-4B-Chat
model. As I have only 8GB RAM on my MAC M1, I use 3bit quantisation and a really small prefill chunk size = 2048. I get the following error on runningmlc_llm chat $IR_FILES
To Reproduce
Steps to reproduce the behavior:
Download model
Convert weights
3.. Generate MLC Chat Config
4. Compile model in library (.so)
5. Run chat
Run the final chat command mlc_llm chat dist/shards --model-lib dist/libs/Qwen1.5-4B-Chat-3B-q3f16_1-metal.so`
Expected behavior
I expected the chat to start using the [redpajama] conversation template (https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/conversation_template/redpajama.py)
Environment
conda
):pip
):python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): f5f048bbd71513f087799f987019e3931f68a6d9