vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
644 stars 52 forks source link

GPTQ Algorithm Cleanup #120

Closed kylesayrs closed 2 months ago

kylesayrs commented 2 months ago

Purpose

  1. Clean up implementation for easier reading (comments, better structure)
  2. Allow the algorithm to be skipped if the layer is not being targeted
  3. Fix bug where layer is not frozen after QuantizationModifier
  4. Prevent weight observer misuse
  5. Depreciate weight_fake_quant use case

Changes

Testing

Regression tested saving, loading, and vllm inferencing with group quantized model

kylesayrs commented 2 months ago

@Satrat Can you specify what you're looking for in a skip test?

Satrat commented 2 months ago

@Satrat Can you specify what you're looking for in a skip test?

You could just initialize a module with some modules skipped (more than the lm_head) and others quantized, then search the logs for the debug string, or just testing your getattr_chain helper function directly on the model would be fine too

kylesayrs commented 2 months ago

Yeah the failing base test is because of a bug from the previous release which I fixed in the main branch See: https://github.com/neuralmagic/compressed-tensors/blame/4b214e582c8434733efea79239cfadec9358b7fb/src/compressed_tensors/quantization/observers/base.py#L165-L167

kylesayrs commented 2 months ago

Using my local machine and the main branch of compressed_tensors, I confirmed that the tests/llmcompressor/modifiers/ and tests/llmcompressor/transformers/compression/ are passing