turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

The EXL2 quantization for Qwen model #381

Closed zchen-cpu closed 5 months ago

zchen-cpu commented 6 months ago

The quantization technique EXL2 has proven to be both effective and efficient when applied to the Llama model. I am keen on implementing it in the Qwen model. However, during this process, I have encountered several errors as follows:

Command:

CUDA_VISIBLE_DEVICES=0,1 python convert.py -i /workspace/data4/models/Qwen-72B-Chat/  -o /workspace/home/zc/AIGC_quantization/quantized_test/ -nr -c /workspace/home/zc/alpaca_data_cleaned.parquet -b 3.3

Error info:

!! Warning, unknown architecture: QWenLMHeadModel
 !! Loading as LlamaForCausalLM
Traceback (most recent call last):
  File "convert.py", line 65, in <module>
    config.prepare()
  File "/workspace/home/zc/exllamav2/exllamav2/config.py", line 103, in prepare
    self.norm_eps = read_config[self.arch.norm_eps_key]
KeyError: 'rms_norm_eps'
turboderp commented 6 months ago

I don't know about supporting Qwen. Qwen2 is already supported, and I think those are just better models anyway? There's a bunch of EXL2 conversions on Hugging Face already.

zchen-cpu commented 6 months ago

I changed to the qwen2 model, but encountered the following error:

 -- Beginning new job
 !! Warning: Output directory is not empty: /mnt/2T/zc/test/
 !! Cleaning output directory: /mnt/2T/zc/test/
 -- Input: /mnt/afs/data/model/open_source_data/Qwen/Qwen1.5-14B/
 -- Output: /mnt/2T/zc/test/
 -- Using default calibration dataset
 -- Target bits per weight: 3.3 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Tokenizing samples (measurement)...
 -- Token embeddings (measurement)...
 -- Measuring quantization impact...
 -- Layer: model.layers.0 (Attention)
Traceback (most recent call last):
  File "convert.py", line 223, in <module>
    status = measure_quant(job, save_job, model)  # capturing the graceful exits
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/2T/zc/SRI_exllama/conversion/measure.py", line 416, in measure_quant
    quantizers["q_proj"] = AdaptiveGPTQ(module.q_proj.linear)
  File "/mnt/2T/zc/SRI_exllama/conversion/adaptivegptq.py", line 112, in __init__
    self.device = layer.weight.device
AttributeError: 'NoneType' object has no attribute 'weight'
turboderp commented 6 months ago

Yes, that's a separate issue, though. It's a bug that should be fixed with the latest commit. There's going to be a new release very soon, but in the meantime you can update to the latest dev release and it should sort itself out.

If you can't build the extension, you can roll back to c60ac6e and use a prebuilt wheel from 0.0.15 which would also allow you to quantize Qwen2 models.

zchen-cpu commented 6 months ago

When will the new version be released? Looking forward to the resolution of this issue.

turboderp commented 6 months ago

I'll release 0.0.17 in the coming days with the fix. But in the meantime you can still convert Qwen by either compiling the current version from source or downgrading to 0.0.15.

jay-c88 commented 6 months ago
CUDA_VISIBLE_DEVICES=0,1 python convert.py -i /workspace/data4/models/Qwen-72B-Chat/  -o /workspace/home/zc/AIGC_quantization/quantized_test/ -nr -c /workspace/home/zc/alpaca_data_cleaned.parquet -b 3.3

Just a quick question on the CUDA_VISIBLE_DEVICES=0,1 Does this work? From the other issue #349 I assumed that I couldn't use 2 GPUs to convert.

turboderp commented 6 months ago

You can set which devices are active that way, but the converter will currently only use the first one. So you can start multiple jobs at once with a different visible GPU for each, but multi-GPU conversion isn't supported yet.