[Bug] ValueError: Unknown model type: mistral. Available ones: ['llama']

🐛 Bug

I've been trying to figure out how to compile TheBloke/zephyr-7B-alpha-AWQ but have been running into error after error. Some tools state that awq isn't valid, others state mistral isn't.

To Reproduce

Steps to reproduce the behavior:

Install via python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-rocm57 mlc-ai-nightly-rocm57
Run python3 -m mlc_chat compile --model ./input/zephyr-7B-alpha-AWQ/ --device rocm --max-sequence-length 8192 --quantization q4f16_awq -o ./output/zephyr.so

Results in:

[2023-11-14 17:59:17] INFO auto_config.py:62: Found model configuration: input/zephyr-7B-alpha-AWQ/config.json
[2023-11-14 17:59:18] INFO auto_target.py:154: Found configuration of target device "rocm:0": {"thread_warp_size": 32, "mtriple": "amdgcn-amd-amdhsa-hcc", "max_threads_per_block": 1024, "max_num_threads": 256, "kind": "rocm", "max_shared_memory_per_block": 65536, "tag": "", "mcpu": "gfx1100", "keys": ["rocm", "gpu"]}
[2023-11-14 17:59:18] INFO auto_target.py:117: Found host LLVM triple: x86_64-unknown-linux-gnu
[2023-11-14 17:59:18] INFO auto_config.py:99: Found model type: mistral. Use `--model-type` to override.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/__main__.py", line 44, in <module>
    main()
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/__main__.py", line 29, in main
    cli.main(sys.argv[2:])
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/cli/compile.py", line 100, in main
    parsed.model_type = detect_model_type(parsed.model_type, parsed.config)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/support/auto_config.py", line 101, in detect_model_type
    raise ValueError(f"Unknown model type: {model_type}. Available ones: {list(MODELS.keys())}")
ValueError: Unknown model type: mistral. Available ones: ['llama']

Attempting to compile anyway using python3 -m mlc_chat convert_weight --quantization q4f16_awq -o output/ --model input/zephyr-7B-alpha-AWQ/ --source-format awq --device rocm --model-type llama --source input/zephyr-7B-alpha-AWQ/model.safetensors

Results in:

[2023-11-14 17:43:16] INFO auto_config.py:62: Found model configuration: input/zephyr-7B-alpha-AWQ/config.json
[2023-11-14 17:43:17] INFO auto_target.py:154: Found configuration of target device "rocm:0": {"thread_warp_size": 32, "mtriple": "amdgcn-amd-amdhsa-hcc", "max_threads_per_block": 1024, "max_num_threads": 256, "kind": "rocm", "max_shared_memory_per_block": 65536, "tag": "", "mcpu": "gfx1100", "keys": ["rocm", "gpu"]}
[2023-11-14 17:43:17] INFO auto_target.py:117: Found host LLVM triple: x86_64-unknown-linux-gnu
Compiling with arguments:
  --config          input/zephyr-7B-alpha-AWQ/config.json
  --quantization    AWQQuantize(name='q4f16_awq', kind='awq', group_size=128, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=16, max_int_value=7, prebuilt_quantize_func={})
  --model-type      llama
  --target          {"thread_warp_size": 32, "host": {"kind": "llvm", "tag": "", "keys": ["cpu"], "mtriple": "x86_64-unknown-linux-gnu"}, "mtriple": "amdgcn-amd-amdhsa-hcc", "max_threads_per_block": 1024, "max_num_threads": 256, "kind": "rocm", "max_shared_memory_per_block": 65536, "tag": "", "mcpu": "gfx1100", "keys": ["rocm", "gpu"]}
  --opt             cutlass_attn=1;cutlass_norm=1;cublas_gemm=0;cudagraph=0
  --prefix-symbols  ""
  --output          output/zephyr.so
  --overrides       {'max_sequence_length': 8192, 'max_batch_size': None, 'num_shards': None}
[2023-11-14 17:43:17] INFO compile.py:98: Creating model from: input/zephyr-7B-alpha-AWQ/config.json
[2023-11-14 17:43:17] INFO llama_model.py:40: max_sequence_length not found in config.json. Falling back to max_position_embeddings (32768)
[2023-11-14 17:43:17] INFO flags_model_config_override.py:22: Overriding max_sequence_length from 32768 to 8192
[2023-11-14 17:43:17] INFO compile.py:102: Exporting the model to TVM Unity compiler
[2023-11-14 17:43:18] INFO compile.py:106: Running optimizations using TVM Unity
[2023-11-14 17:43:18] INFO pipeline.py:29: Running TVM Relax graph-level optimizations
[2023-11-14 17:43:20] INFO pipeline.py:29: Lowering to TVM TIR kernels
[2023-11-14 17:43:25] INFO pipeline.py:29: Running TVM TIR-level optimizations
[2023-11-14 17:43:35] INFO pipeline.py:29: Running TVM Dlight low-level optimizations
[2023-11-14 17:43:37] INFO pipeline.py:29: Running memory optimizations
[2023-11-14 17:43:38] INFO compile.py:111: Generating code using TVM Unity
[2023-11-14 17:43:49] INFO compile.py:113: Generated: output/zephyr.so
(venv) root@omnipedia-docker:/opt/mlc-llm# python3 -m mlc_chat convert_weight --quantization q4f16_awq -o output/ --model input/zephyr-7B-alpha-AWQ/ --source-format awq --device rocm --model-type llama --source input/zephyr-7B-alpha-AWQ/model.safetensors 
[2023-11-14 17:44:32] INFO auto_config.py:62: Found model configuration: input/zephyr-7B-alpha-AWQ/config.json
[2023-11-14 17:44:33] INFO auto_weight.py:70: Finding weights in: input/zephyr-7B-alpha-AWQ/model.safetensors
Weight conversion with arguments:
  --config          input/zephyr-7B-alpha-AWQ/config.json
  --quantization    AWQQuantize(name='q4f16_awq', kind='awq', group_size=128, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', num_elem_per_storage=8, num_storage_per_group=16, max_int_value=7, prebuilt_quantize_func={})
  --model-type      llama
  --device          rocm:0
  --source          input/zephyr-7B-alpha-AWQ/model.safetensors
  --source-format   awq
  --output          output
[2023-11-14 17:44:33] INFO llama_model.py:40: max_sequence_length not found in config.json. Falling back to max_position_embeddings (32768)
[2023-11-14 17:44:37] INFO huggingface_loader.py:169: Loading HF parameters from: input/zephyr-7B-alpha-AWQ/model.safetensors
[2023-11-14 17:44:41] INFO huggingface_loader.py:129: [Not quantized] Parameter: "lm_head.weight", shape: (32000, 4096), dtype: float16
[2023-11-14 17:44:42] INFO huggingface_loader.py:129: [Not quantized] Parameter: "model.embed_tokens.weight", shape: (32000, 4096), dtype: float16
[2023-11-14 17:44:42] INFO huggingface_loader.py:129: [Not quantized] Parameter: "model.layers.0.input_layernorm.weight", shape: (4096,), dtype: float16
[2023-11-14 17:44:42] INFO huggingface_loader.py:129: [Not quantized] Parameter: "model.layers.0.mlp.down_proj.qweight", shape: (14336, 512), dtype: uint32
  1%|▎                                                | 3/451 [00:01<03:34,  2.08it/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/__main__.py", line 44, in <module>
    main()
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/__main__.py", line 33, in main
    cli.main(sys.argv[2:])
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/cli/convert_weight.py", line 87, in main
    convert_weight(
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/compiler/convert_weight.py", line 141, in convert_weight
    _convert_args(args)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/compiler/convert_weight.py", line 102, in _convert_args
    _check_param(name, param)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_chat/compiler/convert_weight.py", line 83, in _check_param
    raise ValueError(
ValueError: Parameter model.layers.0.mlp.down_proj.qweight has shape (14336, 512), but expected (4096, 1792)

Expected behavior

Mistral is in the list of supported model types

Environment

Platform: ROCm
Operating system: Ubuntu
Device: 7900 XTX
How you installed MLC-LLM: pip AND source
How you installed TVM-Unity: pip
Python version: 3.10
GPU driver version (if applicable): 5.7

TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):

USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU: 
CUDA_VERSION: NOT-FOUND
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM: 
USE_OPENCL_GTEST: /path/to/opencl/gtest
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: /opt/rocm/rccl/
USE_LLVM: /opt/rocm/llvm/bin/llvm-config --ignore-libllvm --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: OFF
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 5e06dab1a3bc691d71230d56d3bb3bb9df1220ce
USE_VULKAN: ON
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2023-11-12 15:22:32 -0500
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: OFF
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 17.0.0git
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_CUBLAS: OFF
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION: 
USE_MIOPEN: OFF
USE_ROCM: ON
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /opt/rh/gcc-toolset-11/root/usr/bin/c++
HIDE_PRIVATE_SYMBOLS: ON

What's going on?

Congrats on finding the next-generation compilation pipeline we've been building so far :)) This is, however, not mature yet and subjects to rapid change, and that's why we haven't announced it yet in our documentation. This is the doc for the current generation pipeline if you wanted to try out Mistral.

@CharlieFRuan and @davidpissarra are the best guys to reach out if you have follow-up questions!

@junrushao The only reason I was trying other commands is because the docs don't work. I did say "error after error" and that I'd used tools that gave AWQ errors, but I guess I'll show you instead:

python3 -m mlc_llm.build --model input/zephyr-7B-alpha-AWQ/ --quantization q4f16_awq --max-seq-len 8192 --target rocm :

usage: build.py [-h] [--model MODEL] [--hf-path HF_PATH]
                [--quantization {autogptq_llama_q4f16_0,autogptq_llama_q4f16_1,q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f16_2,q4f16_ft,q4f32_0,q4f32_1,q8f16_ft,q8f16_1}]
                [--max-seq-len MAX_SEQ_LEN] [--max-vocab-size MAX_VOCAB_SIZE]
                [--target TARGET] [--reuse-lib REUSE_LIB]
                [--artifact-path ARTIFACT_PATH] [--use-cache USE_CACHE]
                [--convert-weights-only] [--build-model-only] [--debug-dump]
                [--debug-load-script] [--llvm-mingw LLVM_MINGW]
                [--cc-path CC_PATH] [--system-lib] [--sep-embed]
                [--use-safetensors] [--enable-batching]
                [--max-batch-size MAX_BATCH_SIZE] [--no-cutlass-attn]
                [--no-cutlass-norm] [--no-cublas] [--use-cuda-graph]
                [--num-shards NUM_SHARDS] [--use-presharded-weights]
                [--use-flash-attn-mqa] [--sliding-window SLIDING_WINDOW]
                [--sliding-window-chunk-size SLIDING_WINDOW_CHUNK_SIZE] [--pdb]
                [--use-vllm-attention] [--convert-weight-only]
build.py: error: argument --quantization: invalid choice: 'q4f16_awq' (choose from 'autogptq_llama_q4f16_0', 'autogptq_llama_q4f16_1', 'q0f16', 'q0f32', 'q3f16_0', 'q3f16_1', 'q4f16_0', 'q4f16_1', 'q4f16_2', 'q4f16_ft', 'q4f32_0', 'q4f32_1', 'q8f16_ft', 'q8f16_1')

python build.py --model ../input/zephyr-7B-alpha-AWQ/ --quantization q4f16_awq --target rocm:

usage: build.py [-h] [--model MODEL] [--hf-path HF_PATH]
                [--quantization {autogptq_llama_q4f16_0,autogptq_llama_q4f16_1,q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f16_2,q4f16_ft,q4f32_0,q4f32_1,q8f16_ft,q8f16_1}]
                [--max-seq-len MAX_SEQ_LEN] [--max-vocab-size MAX_VOCAB_SIZE]
                [--target TARGET] [--reuse-lib REUSE_LIB]
                [--artifact-path ARTIFACT_PATH] [--use-cache USE_CACHE]
                [--convert-weights-only] [--build-model-only] [--debug-dump]
                [--debug-load-script] [--llvm-mingw LLVM_MINGW]
                [--cc-path CC_PATH] [--system-lib] [--sep-embed]
                [--use-safetensors] [--enable-batching]
                [--max-batch-size MAX_BATCH_SIZE] [--no-cutlass-attn]
                [--no-cutlass-norm] [--no-cublas] [--use-cuda-graph]
                [--num-shards NUM_SHARDS] [--use-presharded-weights]
                [--use-flash-attn-mqa] [--sliding-window SLIDING_WINDOW]
                [--sliding-window-chunk-size SLIDING_WINDOW_CHUNK_SIZE] [--pdb]
                [--use-vllm-attention] [--convert-weight-only]
build.py: error: argument --quantization: invalid choice: 'q4f16_awq' (choose from 'autogptq_llama_q4f16_0', 'autogptq_llama_q4f16_1', 'q0f16', 'q0f32', 'q3f16_0', 'q3f16_1', 'q4f16_0', 'q4f16_1', 'q4f16_2', 'q4f16_ft', 'q4f32_0', 'q4f32_1', 'q8f16_ft', 'q8f16_1')

Looks like you have 2 pipelines: One that understands mistral, and one that understands awq. Neither can handle both.

AWQ is not a hard dependency running Mistral either. You may use q4f16_1 which is the fastest solution so far

Looks like you have 2 pipelines: One that understands mistral, and one that understands awq. Neither can handle both.

I'd love to clarify this: as documented, currently the official pipeline is mlc_llm.build, which supports Mistral and provides quantization formats like q4f16_1 (4bit), and we highly recommend to stick with this pipeline. All the on-going efforts, if not documented, are not mature and it's highly recommended against using. We will update the documentation once they are mature to use.

That said, we may need some time to mature the new pipeline, including its support for Mistral and AWQ. And once it’s mature, we will make sure the documentation is updated. Software cannot be developed in one night. Thanks for your understanding!

Meanwhile, please use the well-documented for Mistral and the 4bit quantization format q4f16_1. Weight-only quantizations are just not too much different from each other.

Okay, trying to follow guides as closely as possible: python3 -m mlc_llm.build --hf-path TheBloke/zephyr-7B-alpha-GPTQ --max-seq-len 8192 --use-safetensors --target rocm --quantization q4f16_1

Results in even more errors:

Weights exist at dist/models/zephyr-7B-alpha-GPTQ, skipping download.
Using path "dist/models/zephyr-7B-alpha-GPTQ" for model "zephyr-7B-alpha-GPTQ"
Target configured: rocm -keys=rocm,gpu -max_num_threads=256 -max_shared_memory_per_block=65536 -max_threads_per_block=256 -mcpu=gfx1100 -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=64
Automatically using target for weight quantization: rocm -keys=rocm,gpu -max_num_threads=256 -max_shared_memory_per_block=65536 -max_threads_per_block=1024 -mcpu=gfx1100 -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=32
Get old param:   0%|                                                                                              | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.                                                   | 0/327 [00:00<?, ?tensors/s]
Get old param:   1%|▍                                                                                     | 1/197 [00:01<04:30,  1.38s/tensors]/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/relax_model/mistral.py:1015: RuntimeWarning: overflow encountered in cast
  return [(torch_pname, torch_param.astype(dtype))]
Get old param:   1%|▊                                                                                   | 2/197 [00:32<1:02:03, 19.10s/tensors]Traceback (most recent call last):                                                                      | 1/327 [00:32<2:58:18, 32.82s/tensors]
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/build.py", line 47, in <module>
    main()
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/build.py", line 43, in main
    core.build_model_from_args(parsed_args)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/core.py", line 860, in build_model_from_args
    params = utils.convert_weights(mod_transform, param_manager, params, args)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/utils.py", line 272, in convert_weights
    vm["transform_params"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/utils.py", line 37, in inner
    return func(*args, **kwargs)
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/relax_model/param_manager.py", line 607, in get_item
    [cached_torch_params[torch_pname] for torch_pname in torch_pnames],
  File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/relax_model/param_manager.py", line 607, in <listcomp>
    [cached_torch_params[torch_pname] for torch_pname in torch_pnames],
KeyError: 'model.layers.0.self_attn.q_proj.weight'

This is driving me nuts :(

So this pipeline will work from llama? I thought it's not possible to use GPTQ weights but must use HF weights. The models I want to try are too big to get as HF. Was hoping for 120b or 70b llama with vulkan and see if faster speeds on P40 are possible + what kind of speed I get on ampere as well.

The commands posted are very helpful because I was a bit lost and looking through source to see how to use AWQ. I will try them out and see if I get a successful compile.

edit: cuda compile for SM61 succeeds but vulkan fails because it can't use float16. :(

edit2: model conversion fails because the 70b is sharded and passing the index.json causes some error about shape. It looks like it tries to use the HF loader to manipulate it.

Hi @Poisonsting @Ph0rk0z, if I understand correctly, compiling pre-quantized weight in mlc-llm is not-so-mature as of now (there is an ongoing effort SLIM mentioned here https://github.com/mlc-ai/mlc-llm/issues/606#issue-1823367316 that tries to support this; see related PRs: https://github.com/mlc-ai/mlc-llm/pulls?q=SLIM+).

Within the not-so-mature support, llama relatively has the most support.

With that being said, newly added models like Mistral hasn't gone through tests with the AWQ or GPTQ. Therefore, to compile Mistral, please follow the steps below:

Clone the original Mistral weights: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
Compile with command python3 build.py --model=Mistral-7B-Instruct-v0.1 --quantization=q4f16_1 --target=YOUR_TARGET
Test withmlc_chat_cli --model Mistral-7B-Instruct-v0.1-q4f16_1, or Python API ChatModule

Note that q4f16_1 is 4-bit quantization with fp16 activation; you could also do 3-bit, or fp32 by substituting the numbers.

These are the steps documented and hence recommended. Please stay tuned for a more mature and generalized support for pre-quantized weights in MLC LLM.

@Poisonsting Ahh for finetuned models like Zephyr, same rule applies as well. Please clone the original Zephyr weights https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha, rather than the pre-quantized weights. Zephyr seems to share the same model architecture of Mistral, so it should work. Let us know. Thanks!

mlc-ai / mlc-llm