Tensor mask error during Int4 weight quantization of a 2:4 sparse model

yzlnew commented 3 months ago

Describe the bug

Performing https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_24_sparse_w4a16

import os
import torch

from llmcompressor.transformers import SparseAutoModelForCausalLM, apply

# define a recipe to handle sparsity, finetuning and quantization
recipe = "2:4_w4a16_recipe.yaml"

# load the model in as bfloat16 to save on memory and compute
model_stub = "Qwen__Qwen1.5-0.5B-Chat"
model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.bfloat16, device_map="auto"
)

# uses LLM Compressor's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# save location of quantized model
output_dir = "qwen0_5b_2:4_w4a16_sparse_ft"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}
max_seq_length = 512
num_calibration_samples = 512

# set training parameters for finetuning
num_train_epochs = 0.01
logging_steps = 500
save_steps = 100000
gradient_checkpointing = True  # saves memory during training
learning_rate = 0.0001
bf16 = False  # using full precision for training
lr_scheduler_type = "cosine"
warmup_ratio = 0.1

# this will run the recipe stage by stage:
# oneshot sparsification -> finetuning -> oneshot quantization
apply(
    model=model,
    dataset=dataset,
    recipe=recipe,
    bf16=bf16,
    output_dir=output_dir,
    splits=splits,
    max_seq_length=max_seq_length,
    num_calibration_samples=num_calibration_samples,
    num_train_epochs=num_train_epochs,
    logging_steps=logging_steps,
    save_steps=save_steps,
    gradient_checkpointing=gradient_checkpointing,
    learning_rate=learning_rate,
    lr_scheduler_type=lr_scheduler_type,
    warmup_ratio=warmup_ratio,
)

Recipe

sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      mask_structure: "2:4"
      sequential_update: false
finetuning_stage:
  run_type: train
  finetuning_modifiers:
    ConstantPruningModifier:
      targets: [
        're:.*q_proj.weight',
        're:.*k_proj.weight', 
        're:.*v_proj.weight',
        're:.*o_proj.weight',
        're:.*gate_proj.weight',
        're:.*up_proj.weight',
        're:.*down_proj.weight',
      ]
      start: 0
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      sequential_update: false
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "channel"
          targets: ["Linear"]

Error on the final quantization stage

    compressed_state_dict = compressor.compress(model, state_dict)
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/compressors/model_compressor.py", line 234, in compress
    compressed_state_dict = self.quantization_compressor.compress(
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/compressors/marlin_24.py", line 149, in compress
    self.validate_sparsity_structure(prefix, value)
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/compressors/marlin_24.py", line 99, in validate_sparsity_structure
    if not tensor_follows_mask_structure(weight):
  File "/opt/conda/lib/python3.10/site-packages/compressed_tensors/utils/helpers.py", line 90, in tensor_follows_mask_structure
    raise ValueError()
ValueError

If I remove the finetune stage, I can generate a model with marlin24 format.

horheynm commented 2 months ago

Hi @yzlnew

I ran your script on main and it ran fine - with finetuning_stage.

Couple questions:

Is "Qwen__Qwen1.5-0.5B-Chat" the same as Qwen/Qwen1.5-0.5B-Chat?
Can you show me how you installed llm-compressor? this is to make sure package versions are consistent.

Thanks

yzlnew commented 2 months ago

@horheynm 1. Yes, I downloaded the model locally. 2. I installed llm-compressor from source, but not the latest from master. Maybe I should try the latest code?

robertgshaw2-neuralmagic commented 2 months ago

@horheynm 1. Yes, I downloaded the model locally. 2. I installed llm-compressor from source, but not the latest from master. Maybe I should try the latest code?

Can you try again with the release? pip install llmcompressor

yzlnew commented 2 months ago

I'm running behind proxy so I have to modify the dataset load logic to run. But I can confirm the latest master is able to perform all three stages with the first two stored in dense format.

vllm-project / llm-compressor

Tensor mask error during Int4 weight quantization of a 2:4 sparse model #67