neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.07k stars 148 forks source link

[MOE Quantization] Warn against "undercalibrated" modules #2262

Open dbogunowicz opened 6 months ago

dbogunowicz commented 6 months ago

Note: this branch requires this PR: https://github.com/neuralmagic/compressed-tensors/pull/46 to land in compressed-tensors.

Example Use:

from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer, oneshot
import os
import torch

model_name = "Isotonic/TinyMixtral-4x248M-MoE"

model = SparseAutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda:0",
    torch_dtype=torch.float16,
)
tokenizer = SparseAutoTokenizer.from_pretrained(
    model_name
)

dataset = "open-platypus"
recipe = "tests/sparseml/transformers/compression/recipes/new_quant_full.yaml"

oneshot(
        model=model,
        dataset=dataset,
        overwrite_output_dir=True,
        output_dir="./output_one_shot",
        recipe=recipe,
        num_calibration_samples=4,
        pad_to_max_length=False,
        min_tokens_per_group = 0.3 
    )
2024-05-15 12:15:13 sparseml.transformers.finetune.runner INFO     *** One Shot ***
2024-05-15 12:15:14 sparseml.core.recipe.recipe INFO     Loading recipe from file tests/sparseml/transformers/compression/recipes/new_quant_full.yaml
/root/sparseml/.venv/lib/python3.10/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_fuse_fn_name" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
/root/sparseml/.venv/lib/python3.10/site-packages/pydantic/_internal/_fields.py:151: UserWarning: Field "model_fuse_fn_kwargs" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
2024-05-15 12:15:14 sparseml.modifiers.quantization_vllm.pytorch INFO     Running vLLMQuantizationModifier calibration with 4 samples...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.19it/s]
2024-05-15 12:15:15 sparseml.modifiers.quantization_vllm.pytorch WARNING  The module_name: model.layers.0.block_sparse_moe.experts.1.w1 received less than 30% of calibration batch tokens (212/970 tokens). This could result may harm the quantization quality.
2024-05-15 12:15:15 sparseml.modifiers.quantization_vllm.pytorch WARNING  The module_name: model.layers.0.block_sparse_moe.experts.1.w2 received less than 30% of calibration batch tokens (212/970 tokens). This could result may harm the quantization quality.
...
2024-05-15 12:15:15 sparseml.modifiers.quantization_vllm.pytorch WARNING  The module_name: model.layers.10.block_sparse_moe.experts.3.w3 received less than 30% of calibration batch tokens (233/970 tokens). This could result may harm the quantization quality.
2024-05-15 12:15:15 sparseml.modifiers.quantization_vllm.pytorch WARNING  The module_name: model.layers.11.block_sparse_moe.experts.2.w1 received less than 30% of calibration batch tokens (21/970 tokens). This could result may harm the quantization quality.
2024-05-15 12:15:15 sparseml.modifiers.quantization_vllm.pytorch WARNING  The module_name: model.layers.11.block_sparse_moe.experts.2.w2 received less than 30% of calibration batch tokens (21/970 tokens). This could result may harm the quantization quality.
2024-05-15 12:15:15 sparseml.modifiers.quantization_vllm.pytorch WARNING  The module_name: model.layers.11.block_sparse_moe.experts.2.w3 received less than 30% of calibration batch tokens (21/970 tokens). This could result may harm the quantization quality.