neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.07k stars 148 forks source link

Fix GPTQ Aliases #2327

Closed Satrat closed 5 months ago

Satrat commented 5 months ago

https://github.com/neuralmagic/compressed-tensors/pull/81 must be merged first

When specifying a scheme preset, the quantization modifier for GPTQ was not being properly initialized. In the example code below, despite specifying a W4A16 scheme the quantization config was always empty: Building quantization modifier with args: {'config_groups': {'config_group_0': QuantizationScheme(targets=['Linear'], weights=None, input_activations=None, output_activations=None)}}

The fix was to update the GPTQ modifier initialization to correctly apply the preset scheme. I've also added unit tests to confirm all variants of the GPTQ recipe are functioning as intended

Example Code

import torch
from datasets import load_dataset
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot
from sparseml.modifiers.quantization.gptq import GPTQModifier
from transformers import AutoTokenizer

NUM_CALIBRATION_SAMPLES = 16
MAX_SEQ_LEN = 2048
MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

gptq = GPTQModifier(
    scheme={"W4A16": ["Linear"]}
)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
ds = ds.map(lambda batch: {"text": tokenizer.apply_chat_template(batch["messages"], tokenize=False)})

oneshot(
    model=model,
    dataset=ds,
    recipe=gptq,
    max_seq_length=MAX_SEQ_LEN,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)