vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
676 stars 58 forks source link

Layers not skipped with ignore=[ "re:.*"] #91

Open horheynm opened 3 months ago

horheynm commented 3 months ago

Describe the bug Cosmetic issue.

Running the code std-out's

===== Compressing layer 23/40  =====
2024-08-15T15:22:59.526464+0000 | compress_module | INFO - Compressing model.layers.22.model.layers.22.self_attn.o_proj...
2024-08-15T15:23:00.110515+0000 | compress | INFO - time 0.51
2024-08-15T15:23:00.110713+0000 | compress | INFO - error 0.00

Expected behavior exit the compress() function early - but GPTQ will still run. we do do need all the layers in the pipeline for the data to flow properly.

Environment Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]:
  2. Python version [e.g. 3.7]:
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]:
  4. ML framework version(s) [e.g. torch 2.3.1]:
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
  6. Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce

recipe = [
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*"]),
]

using examples/big_models_with_accelerate/multi_gpu_int8.py.

Errors If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

Additional context Add any other context about the problem here. Also include any relevant files.

fengyang95 commented 2 months ago

Maybe you can pass the layers you want to quantize into the sequential_targets.

markurtz commented 4 weeks ago

@kylesayrs I believe this should be fixed with the work you're doing, can you confirm?