vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
407 stars 29 forks source link

Auto-Infer `mappings` Argument for `SmoothQuantModifier` Based on Model Architecture #119

Open rahul-tuli opened 2 weeks ago

rahul-tuli commented 2 weeks ago

Description:

This PR introduces a feature that automatically infers the mappings argument for the SmoothQuantModifier based on the model architecture, eliminating the need for manual specification of layer mappings.

Before:

In the prior implementation, users had to manually define layer mappings, as shown below:

quantization_stage:
  quantization_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.5
      mappings: [
        [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
        [["re:.*gate"], "re:.*post_attention_layernorm"]
      ]
      ignore: ["lm_head"]

Now:

With this update, the SmoothQuantModifier automatically infers the mappings based on the architecture, simplifying the configuration:

quantization_stage:
  quantization_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.5
      ignore: ["lm_head"]

Key Changes:

Motivation:

These changes improve usability by automating configuration setup and reducing user overhead, as outlined in the design document: Link to Design Doc. This also ensures that the quantization recipes are adaptable to various model architectures without manual intervention.

The autoinference of mappings were tested using a Mixtral model: Isotonic/TinyMixtral-4x248M-MoE