intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.15k stars 252 forks source link

INT6/INT4 support for model optimization #1100

Closed JiaojiaoYe1994 closed 10 months ago

JiaojiaoYe1994 commented 1 year ago

Dear Author,

thank you so much for the project. I have read your results, and have a question regarding the implementation of the PTQ. Does it support INT4/INT6 or other?

hshen14 commented 1 year ago

yes, INT4/INT6/INT8 are supported in the context of weight-only post-training quantization.

JiaojiaoYe1994 commented 1 year ago

yes, INT4/INT6/INT8 are supported in the context of weight-only post-training quantization.

I see, for example, assume that I will use Post-train static quantization, i.e quantize activation and weight at the same time, can I apply INT4/INT6/INT8 quantization?

hshen14 commented 1 year ago

yes, INT4/INT6/INT8 are supported in the context of weight-only post-training quantization.

I see, for example, assume that I will use Post-train static quantization, i.e quantize activation and weight at the same time, can I apply INT4/INT6/INT8 quantization?

INT8 is supported for both activation and weight, while INT4/INT6 is weight only.

paul-ang commented 1 year ago

What is the appropriate workflow to quantize the weights to INT4 and the activations to INT8? Are we able to achieve this in one .fit() session?

hshen14 commented 1 year ago

What is the appropriate workflow to quantize the weights to INT4 and the activations to INT8? Are we able to achieve this in one .fit() session?

The quantization flow is quite similar (under fit), while additional config is required. See the example: https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md.

paul-ang commented 12 months ago

I tried that example, but it doesn't work unfortunately. I tried using these two configurations:

Attempt 1

qt_conf = PostTrainingQuantConfig(
        approach="weight_only",
        op_type_dict={
            ".*": {  # re.match
                "weight": {
                    "bits": 4,  # 1-8 bit
                    "group_size": -1,  # -1 (per-channel)
                    "scheme": "sym",
                    "algorithm": "RTN",
                },
            },
        },
    )

This caused an exception KeyError ('default'). Error trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/quantization.py", line 223, in fit
    strategy.traverse()
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/strategy/auto.py", line 134, in traverse
    super().traverse()
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/strategy/strategy.py", line 482, in traverse
    self._prepare_tuning()
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/strategy/strategy.py", line 378, in _prepare_tuning
    self.capability = self.capability or self.adaptor.query_fw_capability(self.model)
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/utils/utility.py", line 301, in fi
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/adaptor/pytorch.py", line 4834, in query_fw_capability
    return self._get_quantizable_ops(model.model)
  File "/usr/local/lib/python3.9/dist-packages/neural_compressor/adaptor/pytorch.py", line 1141, in _get_quantizable_ops
    else copy.deepcopy(capability["default"])
KeyError: 'default'

Attempt 2

    qt_conf = PostTrainingQuantConfig(
        domain="object_detection",
        excluded_precisions=["bf16", "fp16"],
        approach="auto",
        quant_level=1,
        op_type_dict={
            "Conv": {
                "weight": {
                    "bits": 4,
                    "group_size": -1,
                    "scheme": "sym",
                    "algorithm": "RTN",
                }
            }
        },
    )

This ran successfully, but I don't think the weights were quantized to 4 bit. There were no accuracy loss and I also manually inspected the model.pt weights file.

I am using torch==2.0.1.

xin3he commented 12 months ago

@paul-ang Only linear is supported in weight-only approach. The issue you met in case 1 is because our configuration mismatch for Conv2d. I will remove Conv2d when fetching quantizable ops.

paul-ang commented 12 months ago

Does this mean that quantizing Conv2D lower than 8 bit is not supported at the moment?

xin3he commented 12 months ago

Does this mean that quantizing Conv2D lower than 8 bit is not supported at the moment?

Yes, usually Conv2d is not a memory-bounding operator so we only support linear for Large Language Models. If you can provide some implementation that used Conv2d with large weight size, we will consider supporting it in weight-only mode.

xin3he commented 12 months ago

@paul-ang Thank you for reporting the Conv2d issue. We have raised a PR to fix it.