mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.14k stars 1.57k forks source link

[Bug] tvm._ffi.base.TVMError: TVMError: Assert fail: T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1]) == 256 #2700

Closed pra-dan closed 2 months ago

pra-dan commented 3 months ago

🐛 Bug

I am trying to optimise the Qwen/Qwen1.5-4B-Chat model. As I have only 8GB RAM on my MAC M1, I use 3bit quantisation and a really small prefill chunk size = 2048. I get the following error on running mlc_llm chat $IR_FILES

[2024-07-28 00:47:14] INFO auto_config.py:70: Found model configuration: dist/shards/mlc-chat-config.json
[2024-07-28 00:47:14] INFO auto_target.py:84: Detecting target device: metal:0
[2024-07-28 00:47:14] INFO auto_target.py:86: Found target: {"thread_warp_size": 32, "max_threads_per_block": 1024, "max_function_args": 31, "max_num_threads": 256, "kind": "metal", "max_shared_memory_per_block": 32768, "tag": "", "keys": ["metal", "gpu"]}
[2024-07-28 00:47:14] INFO auto_target.py:103: Found host LLVM triple: arm64-apple-darwin22.2.0
[2024-07-28 00:47:14] INFO auto_target.py:104: Found host LLVM CPU: apple-m1
[2024-07-28 00:47:14] INFO auto_config.py:154: Found model type: qwen2. Use `--model-type` to override.
Compiling with arguments:
  --config          QWen2Config(hidden_act='silu', hidden_size=2560, intermediate_size=6912, num_attention_heads=20, num_hidden_layers=40, num_key_value_heads=20, rms_norm_eps=1e-06, rope_theta=5000000.0, vocab_size=151936, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='q3f16_1', kind='group-quant', group_size=40, quantize_dtype='int3', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=10, num_storage_per_group=4, max_int_value=3)
  --model-type      qwen2
  --target          {"thread_warp_size": 32, "host": {"mtriple": "arm64-apple-darwin22.2.0", "tag": "", "kind": "llvm", "mcpu": "apple-m1", "keys": ["arm_cpu", "cpu"]}, "max_threads_per_block": 1024, "max_function_args": 31, "max_num_threads": 256, "kind": "metal", "max_shared_memory_per_block": 32768, "tag": "", "keys": ["metal", "gpu"]}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /var/folders/3d/_9ftlcj54396cwckpfssmw_h0000gn/T/tmpuh24ym5s/lib.dylib
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-07-28 00:47:14] INFO config.py:107: Overriding tensor_parallel_shards from 1 to 1
[2024-07-28 00:47:14] INFO compile.py:127: Creating model from: QWen2Config(hidden_act='silu', hidden_size=2560, intermediate_size=6912, num_attention_heads=20, num_hidden_layers=40, num_key_value_heads=20, rms_norm_eps=1e-06, rope_theta=5000000.0, vocab_size=151936, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={})
[2024-07-28 00:47:14] INFO compile.py:145: Exporting the model to TVM Unity compiler
[2024-07-28 00:47:17] INFO compile.py:151: Running optimizations using TVM Unity
[2024-07-28 00:47:17] INFO compile.py:171: Registering metadata: {'model_type': 'qwen2', 'quantization': 'q3f16_1', 'context_window_size': 32768, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-07-28 00:47:18] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-07-28 00:47:26] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-07-28 00:47:33] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-07-28 00:47:59] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-07-28 00:48:01] INFO pipeline.py:52: Lowering to VM bytecode
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 10.00 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 50.70 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 157.76 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 1298.00 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.63 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 10.00 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 111.58 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-07-28 00:48:07] INFO pipeline.py:52: Compiling external modules
[2024-07-28 00:48:07] INFO pipeline.py:52: Compilation complete! Exporting to disk
[2024-07-28 00:48:17] INFO model_metadata.py:95: Total memory usage without KV cache:: 2994.43 MB (Parameters: 1696.43 MB. Temporary buffer: 1298.00 MB)
[2024-07-28 00:48:17] INFO model_metadata.py:103: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-07-28 00:48:17] INFO compile.py:193: Generated: /var/folders/3d/_9ftlcj54396cwckpfssmw_h0000gn/T/tmpuh24ym5s/lib.dylib
[2024-07-28 00:48:17] INFO jit.py:128: Using compiled model lib: /Users/prashantdandriyal/.cache/mlc_llm/model_lib/82941e4cf5dae160d69bd8844e5ef61e.dylib
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 593, prefill chunk size will be set to 593. 
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 611, prefill chunk size will be set to 611. 
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 138, prefill chunk size will be set to 2048. 
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:701: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 611, prefill chunk size is 611.
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:706: Estimated total single GPU memory usage: 4641.777 MB (Parameters: 1696.427 MB. KVCache: 327.001 MB. Temporary buffer: 2618.349 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: TVMError: Assert fail: T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1]) == 256, Argument fused_fused_dequantize_take1.p_lv2656.shape[1] has an unsatisfied constraint: 256 == T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1])

To Reproduce

Steps to reproduce the behavior:

  1. Download model

    huggingface-cli download --local-dir dist Qwen/Qwen1.5-4B-Chat
  2. Convert weights

    # convert weights
    MODEL_PATH=dist
    QUANTIZATION=q4f16_1
    MODEL_NAME=Qwen1.5-4B-Chat
    IR_FILES=dist/shards
    mlc_llm convert_weight $MODEL_PATH/ --quantization $QUANTIZATION -o $IR_FILES

3.. Generate MLC Chat Config

MODEL_PATH=dist
QUANTIZATION=q3f16_1
IR_FILES=dist/shards
mlc_llm gen_config  $MODEL_PATH \
    --prefill-chunk-size 2048 \
    --quantization $QUANTIZATION --conv-template redpajama_chat \
    -o $IR_FILES

4. Compile model in library (.so)

# Create output directory for the model library compiled
mkdir dist/libs

# compile
MLC_CHAT_CONFIG=dist/shards/mlc-chat-config.json
QUANTIZATION=q3f16_1
mlc_llm compile $MLC_CHAT_CONFIG \
    --device metal -o dist/libs/Qwen1.5-4B-Chat-3B-$QUANTIZATION-metal.so

5. Run chat

Run the final chat command mlc_llm chat dist/shards --model-lib dist/libs/Qwen1.5-4B-Chat-3B-q3f16_1-metal.so`

Expected behavior

I expected the chat to start using the [redpajama] conversation template (https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/conversation_template/redpajama.py)

Environment

MasterJH5574 commented 3 months ago

Thanks for reporting. We will look into the potential issue of q3f16_1. Given it is a 4.5B model, would you mind trying the 4-bit quantization q4f16_1?