vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
658 stars 54 forks source link

Tensor mask error during Int4 weight quantization of a 2:4 sparse model #67

Closed yzlnew closed 2 months ago

yzlnew commented 3 months ago

Describe the bug

Performing https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_24_sparse_w4a16

import os
import torch

from llmcompressor.transformers import SparseAutoModelForCausalLM, apply

# define a recipe to handle sparsity, finetuning and quantization
recipe = "2:4_w4a16_recipe.yaml"

# load the model in as bfloat16 to save on memory and compute
model_stub = "Qwen__Qwen1.5-0.5B-Chat"
model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.bfloat16, device_map="auto"
)

# uses LLM Compressor's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# save location of quantized model
output_dir = "qwen0_5b_2:4_w4a16_sparse_ft"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}
max_seq_length = 512
num_calibration_samples = 512

# set training parameters for finetuning
num_train_epochs = 0.01
logging_steps = 500
save_steps = 100000
gradient_checkpointing = True  # saves memory during training
learning_rate = 0.0001
bf16 = False  # using full precision for training
lr_scheduler_type = "cosine"
warmup_ratio = 0.1

# this will run the recipe stage by stage:
# oneshot sparsification -> finetuning -> oneshot quantization
apply(
    model=model,
    dataset=dataset,
    recipe=recipe,
    bf16=bf16,
    output_dir=output_dir,
    splits=splits,
    max_seq_length=max_seq_length,
    num_calibration_samples=num_calibration_samples,
    num_train_epochs=num_train_epochs,
    logging_steps=logging_steps,
    save_steps=save_steps,
    gradient_checkpointing=gradient_checkpointing,
    learning_rate=learning_rate,
    lr_scheduler_type=lr_scheduler_type,
    warmup_ratio=warmup_ratio,
)

Recipe

sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      mask_structure: "2:4"
      sequential_update: false
finetuning_stage:
  run_type: train
  finetuning_modifiers:
    ConstantPruningModifier:
      targets: [
        're:.*q_proj.weight',
        're:.*k_proj.weight', 
        're:.*v_proj.weight',
        're:.*o_proj.weight',
        're:.*gate_proj.weight',
        're:.*up_proj.weight',
        're:.*down_proj.weight',
      ]
      start: 0
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      sequential_update: false
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "channel"
          targets: ["Linear"]

Error on the final quantization stage

    compressed_state_dict = compressor.compress(model, state_dict)
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/compressors/model_compressor.py", line 234, in compress
    compressed_state_dict = self.quantization_compressor.compress(
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/compressors/marlin_24.py", line 149, in compress
    self.validate_sparsity_structure(prefix, value)
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/compressors/marlin_24.py", line 99, in validate_sparsity_structure
    if not tensor_follows_mask_structure(weight):
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/utils/helpers.py", line 90, in tensor_follows_mask_structure
    raise ValueError()
ValueError

If I remove the finetune stage, I can generate a model with marlin24 format.

horheynm commented 2 months ago

Hi @yzlnew

I ran your script on main and it ran fine - with finetuning_stage.

Couple questions:

  1. Is "Qwen__Qwen1.5-0.5B-Chat" the same as Qwen/Qwen1.5-0.5B-Chat?
  2. Can you show me how you installed llm-compressor? this is to make sure package versions are consistent.

Thanks

yzlnew commented 2 months ago

@horheynm 1. Yes, I downloaded the model locally. 2. I installed llm-compressor from source, but not the latest from master. Maybe I should try the latest code?

robertgshaw2-neuralmagic commented 2 months ago

@horheynm 1. Yes, I downloaded the model locally. 2. I installed llm-compressor from source, but not the latest from master. Maybe I should try the latest code?

Can you try again with the release? pip install llmcompressor

yzlnew commented 2 months ago

I'm running behind proxy so I have to modify the dataset load logic to run. But I can confirm the latest master is able to perform all three stages with the first two stored in dense format.