neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.01k stars 140 forks source link

[GPTQ UX] Add scheme arg with QuantizationScheme support #2286

Closed rahul-tuli closed 1 month ago

rahul-tuli commented 1 month ago

This PR adds support for a scheme arg in GPTQ, this arg can be set to a single QuantizationScheme object

recipe:

test_stage:
  obcq_modifiers:
    GPTQModifier:
        ignore: ["LlamaRotaryEmbedding", "LlamaRMSNorm", "SiLUActivation", "MatMulLeftInput_QK", "MatMulRightInput_QK", "MatMulLeftInput_PV", "MatMulRightInput_PV", "MatMulOutput_QK", "MatMulOutput_PV", "lm_head", "Embedding"]
        sequential_update: True
        dampening_frac: 0.001
        block_size: 128
        targets: ["Linear"]
        scheme:
          input_activations: null
          output_activations: null
          weights:
              num_bits: 8
              type: "int"
              symmetric: true
              strategy: "tensor"
              group_size: 128

test script:

from pathlib import Path
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot
import argparse
from datetime import datetime

tinyllama_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tiny_random_llama_stub = "HuggingFaceH4/tiny-random-LlamaForCausalLM"

parser = argparse.ArgumentParser(description="Get Quant Model")
parser.add_argument('--recipe', default="/root/projects/sparseml/local/feature/recipe.yaml", help='Path to the recipe')
parser.add_argument('--model_stub', default=tinyllama_stub, help='Model stub')
parser.add_argument('--dataset', default="open_platypus", help='Dataset name')
parser.add_argument('--max_seq_length', type=int, default=512, help='Maximum sequence length')
parser.add_argument('--output_dir', default=None, help='Output directory')
parser.add_argument('--num_calibration_samples', type=int, default=512, help='Number of calibration samples')
parser.add_argument('--overwrite_output_dir', action='store_true', help='Overwrite output directory')
parser.add_argument('--small', action='store_true', help='Use a small model')
args = parser.parse_args()

def get_save_dir_name(model_stub):
        dir_name = f"{model_stub.split('/')[-1]}_{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
        return str(Path("output") / dir_name)

recipe = args.recipe
model_stub = tiny_random_llama_stub if args.small else args.model_stub 
dataset = args.dataset
max_seq_length = args.max_seq_length
output_dir = args.output_dir or get_save_dir_name(model_stub)
num_calibration_samples = args.num_calibration_samples
device = "cuda"

oneshot(
        model=model_stub,
        dataset=dataset,
        overwrite_output_dir=True,
        output_dir=output_dir,
        max_seq_length=max_seq_length,
        num_calibration_samples=num_calibration_samples,
        recipe=recipe,
        oneshot_device=device,
)

# try reloading the model

model_new = SparseAutoModelForCausalLM.from_pretrained(output_dir)
print("Model reloaded successfully!")

test command:

python get_quant_model.py --small --recipe ./gptq_ux/recipes/recipe_scheme_quant_scheme.yaml

Output:

Calculating quantization compression ratio: 25it [00:00, 697.59it/s]
2024-05-15 13:26:18 sparseml.pytorch.model_load.helpers INFO     Saving output to /root/projects/sparseml/output/tiny-random-LlamaForCausalLM_2024-05-15-13-26-00
Decompressing model: 0it [00:00, ?it/s]
Model reloaded successfully!