Closed Poisonsting closed 11 months ago
Congrats on finding the next-generation compilation pipeline we've been building so far :)) This is, however, not mature yet and subjects to rapid change, and that's why we haven't announced it yet in our documentation. This is the doc for the current generation pipeline if you wanted to try out Mistral.
@CharlieFRuan and @davidpissarra are the best guys to reach out if you have follow-up questions!
@junrushao The only reason I was trying other commands is because the docs don't work. I did say "error after error" and that I'd used tools that gave AWQ errors, but I guess I'll show you instead:
python3 -m mlc_llm.build --model input/zephyr-7B-alpha-AWQ/ --quantization q4f16_awq --max-seq-len 8192 --target rocm
:
usage: build.py [-h] [--model MODEL] [--hf-path HF_PATH]
[--quantization {autogptq_llama_q4f16_0,autogptq_llama_q4f16_1,q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f16_2,q4f16_ft,q4f32_0,q4f32_1,q8f16_ft,q8f16_1}]
[--max-seq-len MAX_SEQ_LEN] [--max-vocab-size MAX_VOCAB_SIZE]
[--target TARGET] [--reuse-lib REUSE_LIB]
[--artifact-path ARTIFACT_PATH] [--use-cache USE_CACHE]
[--convert-weights-only] [--build-model-only] [--debug-dump]
[--debug-load-script] [--llvm-mingw LLVM_MINGW]
[--cc-path CC_PATH] [--system-lib] [--sep-embed]
[--use-safetensors] [--enable-batching]
[--max-batch-size MAX_BATCH_SIZE] [--no-cutlass-attn]
[--no-cutlass-norm] [--no-cublas] [--use-cuda-graph]
[--num-shards NUM_SHARDS] [--use-presharded-weights]
[--use-flash-attn-mqa] [--sliding-window SLIDING_WINDOW]
[--sliding-window-chunk-size SLIDING_WINDOW_CHUNK_SIZE] [--pdb]
[--use-vllm-attention] [--convert-weight-only]
build.py: error: argument --quantization: invalid choice: 'q4f16_awq' (choose from 'autogptq_llama_q4f16_0', 'autogptq_llama_q4f16_1', 'q0f16', 'q0f32', 'q3f16_0', 'q3f16_1', 'q4f16_0', 'q4f16_1', 'q4f16_2', 'q4f16_ft', 'q4f32_0', 'q4f32_1', 'q8f16_ft', 'q8f16_1')
python build.py --model ../input/zephyr-7B-alpha-AWQ/ --quantization q4f16_awq --target rocm
:
usage: build.py [-h] [--model MODEL] [--hf-path HF_PATH]
[--quantization {autogptq_llama_q4f16_0,autogptq_llama_q4f16_1,q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f16_2,q4f16_ft,q4f32_0,q4f32_1,q8f16_ft,q8f16_1}]
[--max-seq-len MAX_SEQ_LEN] [--max-vocab-size MAX_VOCAB_SIZE]
[--target TARGET] [--reuse-lib REUSE_LIB]
[--artifact-path ARTIFACT_PATH] [--use-cache USE_CACHE]
[--convert-weights-only] [--build-model-only] [--debug-dump]
[--debug-load-script] [--llvm-mingw LLVM_MINGW]
[--cc-path CC_PATH] [--system-lib] [--sep-embed]
[--use-safetensors] [--enable-batching]
[--max-batch-size MAX_BATCH_SIZE] [--no-cutlass-attn]
[--no-cutlass-norm] [--no-cublas] [--use-cuda-graph]
[--num-shards NUM_SHARDS] [--use-presharded-weights]
[--use-flash-attn-mqa] [--sliding-window SLIDING_WINDOW]
[--sliding-window-chunk-size SLIDING_WINDOW_CHUNK_SIZE] [--pdb]
[--use-vllm-attention] [--convert-weight-only]
build.py: error: argument --quantization: invalid choice: 'q4f16_awq' (choose from 'autogptq_llama_q4f16_0', 'autogptq_llama_q4f16_1', 'q0f16', 'q0f32', 'q3f16_0', 'q3f16_1', 'q4f16_0', 'q4f16_1', 'q4f16_2', 'q4f16_ft', 'q4f32_0', 'q4f32_1', 'q8f16_ft', 'q8f16_1')
Looks like you have 2 pipelines: One that understands mistral, and one that understands awq. Neither can handle both.
AWQ is not a hard dependency running Mistral either. You may use q4f16_1
which is the fastest solution so far
Looks like you have 2 pipelines: One that understands mistral, and one that understands awq. Neither can handle both.
I'd love to clarify this: as documented, currently the official pipeline is mlc_llm.build
, which supports Mistral and provides quantization formats like q4f16_1 (4bit), and we highly recommend to stick with this pipeline. All the on-going efforts, if not documented, are not mature and it's highly recommended against using. We will update the documentation once they are mature to use.
That said, we may need some time to mature the new pipeline, including its support for Mistral and AWQ. And once itβs mature, we will make sure the documentation is updated. Software cannot be developed in one night. Thanks for your understanding!
Meanwhile, please use the well-documented for Mistral and the 4bit quantization format q4f16_1. Weight-only quantizations are just not too much different from each other.
Okay, trying to follow guides as closely as possible: python3 -m mlc_llm.build --hf-path TheBloke/zephyr-7B-alpha-GPTQ --max-seq-len 8192 --use-safetensors --target rocm --quantization q4f16_1
Results in even more errors:
Weights exist at dist/models/zephyr-7B-alpha-GPTQ, skipping download.
Using path "dist/models/zephyr-7B-alpha-GPTQ" for model "zephyr-7B-alpha-GPTQ"
Target configured: rocm -keys=rocm,gpu -max_num_threads=256 -max_shared_memory_per_block=65536 -max_threads_per_block=256 -mcpu=gfx1100 -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=64
Automatically using target for weight quantization: rocm -keys=rocm,gpu -max_num_threads=256 -max_shared_memory_per_block=65536 -max_threads_per_block=1024 -mcpu=gfx1100 -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 1%|β | 1/197 [00:01<04:30, 1.38s/tensors]/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/relax_model/mistral.py:1015: RuntimeWarning: overflow encountered in cast
return [(torch_pname, torch_param.astype(dtype))]
Get old param: 1%|β | 2/197 [00:32<1:02:03, 19.10s/tensors]Traceback (most recent call last): | 1/327 [00:32<2:58:18, 32.82s/tensors]
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/build.py", line 47, in <module>
main()
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/build.py", line 43, in main
core.build_model_from_args(parsed_args)
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/core.py", line 860, in build_model_from_args
params = utils.convert_weights(mod_transform, param_manager, params, args)
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/utils.py", line 272, in convert_weights
vm["transform_params"]()
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/utils.py", line 37, in inner
return func(*args, **kwargs)
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/relax_model/param_manager.py", line 607, in get_item
[cached_torch_params[torch_pname] for torch_pname in torch_pnames],
File "/opt/mlc-llm/venv/lib/python3.10/site-packages/mlc_llm/relax_model/param_manager.py", line 607, in <listcomp>
[cached_torch_params[torch_pname] for torch_pname in torch_pnames],
KeyError: 'model.layers.0.self_attn.q_proj.weight'
This is driving me nuts :(
So this pipeline will work from llama? I thought it's not possible to use GPTQ weights but must use HF weights. The models I want to try are too big to get as HF. Was hoping for 120b or 70b llama with vulkan and see if faster speeds on P40 are possible + what kind of speed I get on ampere as well.
The commands posted are very helpful because I was a bit lost and looking through source to see how to use AWQ. I will try them out and see if I get a successful compile.
edit: cuda compile for SM61 succeeds but vulkan fails because it can't use float16. :(
edit2: model conversion fails because the 70b is sharded and passing the index.json causes some error about shape. It looks like it tries to use the HF loader to manipulate it.
Hi @Poisonsting @Ph0rk0z, if I understand correctly, compiling pre-quantized weight in mlc-llm is not-so-mature as of now (there is an ongoing effort SLIM mentioned here https://github.com/mlc-ai/mlc-llm/issues/606#issue-1823367316 that tries to support this; see related PRs: https://github.com/mlc-ai/mlc-llm/pulls?q=SLIM+).
Within the not-so-mature support, llama relatively has the most support.
With that being said, newly added models like Mistral hasn't gone through tests with the AWQ or GPTQ. Therefore, to compile Mistral, please follow the steps below:
python3 build.py --model=Mistral-7B-Instruct-v0.1 --quantization=q4f16_1 --target=YOUR_TARGET
mlc_chat_cli --model Mistral-7B-Instruct-v0.1-q4f16_1
, or Python API ChatModuleNote that q4f16_1
is 4-bit quantization with fp16 activation; you could also do 3-bit, or fp32 by substituting the numbers.
These are the steps documented and hence recommended. Please stay tuned for a more mature and generalized support for pre-quantized weights in MLC LLM.
@Poisonsting Ahh for finetuned models like Zephyr, same rule applies as well. Please clone the original Zephyr weights https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha, rather than the pre-quantized weights. Zephyr seems to share the same model architecture of Mistral, so it should work. Let us know. Thanks!
Any updates?
π Bug
I've been trying to figure out how to compile TheBloke/zephyr-7B-alpha-AWQ but have been running into error after error. Some tools state that awq isn't valid, others state mistral isn't.
To Reproduce
Steps to reproduce the behavior:
python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-rocm57 mlc-ai-nightly-rocm57
python3 -m mlc_chat compile --model ./input/zephyr-7B-alpha-AWQ/ --device rocm --max-sequence-length 8192 --quantization q4f16_awq -o ./output/zephyr.so
Results in:
python3 -m mlc_chat convert_weight --quantization q4f16_awq -o output/ --model input/zephyr-7B-alpha-AWQ/ --source-format awq --device rocm --model-type llama --source input/zephyr-7B-alpha-AWQ/model.safetensors
Results in:
Expected behavior
Mistral is in the list of supported model types
Environment
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):What's going on?