vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
734 stars 60 forks source link

W8A8 quantization failed with bloomz-7b1 #905

Closed moonlightian closed 1 week ago

moonlightian commented 2 weeks ago

It seems that bloom is not support for quantization right now

Traceback (most recent call last):
  File "/home/work/vllm-main/scripts/w8a8v2.py", line 40, in <module>
    oneshot(
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
    main(model_args, data_args, training_args)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 364, in main
    stage_runner.one_shot()
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/transformers/finetune/runner.py", line 171, in one_shot
    self.trainer.one_shot(calibration_data=calib_data, stage=stage)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 401, in one_shot
    apply(
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
    return active_session().apply(
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/core/session.py", line 210, in apply
    self.initialize(**kwargs)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/core/session.py", line 156, in initialize
    mod_data = self._lifecycle.initialize(
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
    data = mod.initialize(state=self.state, **extras)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
    modifier.initialize(state, **kwargs)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
    initialized = self.on_initialize(state=state, **kwargs)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 127, in on_initialize
    self.resolved_mappings_ = self._resolve_mappings(state.model)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/modifiers/smoothquant/base.py", line 184, in _resolve_mappings
    _, balance_layer = get_matching_layer(
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/utils/pytorch/module.py", line 311, in get_matching_layer
    potential_matches = get_layers(target, module)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/utils/pytorch/module.py", line 166, in get_layers
    return match_layers_params(targets, module)
  File "/home/bml/.local/lib/python3.9/site-packages/llmcompressor/utils/pytorch/module.py", line 160, in match_layers_params
    raise ValueError(f"Could not find targets {missed} in module {module}")
ValueError: Could not find targets ['re:.*q_proj'] in module BloomForCausalLM
kylesayrs commented 2 weeks ago

Hi @moonlightian!

The issue you are encountering is related to the default mapping used by the SmoothQuantModifier. For your case, you'll need to provide custom mappings specific to the BLOOM architecture.

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot

# modify recipe as needed
recipe = """
quant_stage:
    quant_modifiers:
        SmoothQuantModifier:
            smoothing_strength: 0.5
            mappings: [
                [["re:.*query_key_value"], "re:.*input_layernorm"],
                [["re:.*dense_h_to_4h"], "re:.*post_attention_layernorm"]
            ]
            ignore: ["model.decoder.final_layer_norm"]
        GPTQModifier:
            dampening_frac: 0.1
            scheme: "W8A8"
            targets: ["Linear"]
            ignore: ["lm_head"]
"""

# remember to substitute for your own datasets
oneshot(
    model="bigscience/bloom-560m",
    dataset="open_platypus",
    recipe=recipe,
    output_dir="outdir",
    max_seq_length=64,
    num_calibration_samples=64,
)