Closed wangxu569 closed 1 year ago
We have refactored the parameter managers and q8f16
is not supported in the new quantization framework yet(https://github.com/mlc-ai/mlc-llm/blob/d800c783337dc10870da3a3fe0b0517d50ba3ab5/mlc_llm/quantization/__init__.py#L84), cc @MasterJH5574
We have refactored the parameter managers and
q8f16
is not supported in the new quantization framework yet(), cc @MasterJH5574
Okay, I see it
Hi @wangxu569, after some recent refactoring now you are able to use q8f16_0
for vicuna. https://github.com/mlc-ai/mlc-llm/blob/f121844287a4ba232e8c76e52e8b30aa24f8e08a/mlc_llm/quantization/__init__.py#L85-L89
Nevertheless, we don't recommend this as right now the q8f16_0
is designed for RWKV, and applying it to Vicuna may get suboptimal performance. We would recommend to use q3f16_0
for now for Vicuna and other LLaMA-family models. Likely next week we will enable a new q4f16_1
quantization mode which has even better performance.
🐛 Bug
When I use 8-bit quantization, this error occurs; Other quantification methods would not have this problem ## To Reproduce Steps to reproduce the behavior: 1. download vicuna-7b-delta-v1.1 model file from (https://huggingface.co/lmsys/vicuna-7b-delta-v1.1) and save to dist/models/vicuna-7b-delta-v1.1 2. python3 build.py --model vicuna-7b-delta-v1.1 --target cuda --quantization q8f16_0 Using path "dist/models/vicuna-7b-delta-v1.1" for model "vicuna-7b-delta-v1.1" Database paths: ['log_db/redpajama-3b-q4f16', 'log_db/rwkv-raven-3b', 'log_db/dolly-v2-3b', 'log_db/redpajama-3b-q4f32', 'log_db/rwkv-raven-1b5', 'log_db/vicuna-v1-7b', 'log_db/rwkv-raven-7b'] Target configured: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -thread_warp_size=32 Traceback (most recent call last): File "/home/xs11/wangxu/mlc-llm/build.py", line 457, inEnvironment
Additional context