intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.15k stars 252 forks source link

Can not quantize model in Per tensor way #82

Closed zihaomu closed 2 years ago

zihaomu commented 2 years ago

Hello teams,

I try to quantize all the parameters of my model in a per_tensor way. And I found that the final output quantization model still contains layers per_channel.

the yaml file is following:

version: 1.0

model:                                               # mandatory. used to specify model specific information.
  name: mobilenetv2
  framework: onnxrt_qlinearops                       # mandatory. supported values are tensorflow, pytorch, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnet; allow new framework backend extension.

quantization:                                        # optional. tuning constraints on model-wise for advance user to reduce tuning space.
  approach: post_training_static_quant               # optional. default value is post_training_static_quant.
  calibration:
    dataloader:
      batch_size: 1
      dataset:
        ImagenetRaw:
          data_path: /home/tau/Workspace/databank/imagenet/ILSVRC/Data/CLS-LOC/val
          image_list: /home/tau/Workspace/databank/imagenet/caffe_labels/val.txt      # download from http://dl.caffe.berkeleyvision.org/caffe_ilsvrc12.tar.gz
      transform:
        Rescale: {}
        Resize:
          size: 256
        CenterCrop:
          size: 224
        Normalize:
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
        Transpose:
          perm: [2, 0, 1]
        Cast:
          dtype: float32
  model_wise:                                        # optional. tuning constraints on model-wise for advance user to reduce tuning space.
    weight:
      granularity: per_tensor
      scheme: asym
      dtype: int8
      algorithm: minmax
    activation:
      granularity: per_tensor
      scheme: asym
      algorithm: minmax

tuning:
  accuracy_criterion:
    relative:  0.02                                  # optional. default value is relative, other value is absolute. this example allows relative accuracy loss: 1%.
  exit_policy:
    timeout: 0                                       # optional. tuning timeout (seconds). default value is 0 which means early stop. combine with max_trials field to decide when to exit.
  random_seed: 9527                                  # optional. random seed for deterministic tuning.

Thanks.

mengniwang95 commented 2 years ago

hi, in order to reduce tuning space and per_channel usually has better accuracy, we just support per_channel in framework yaml https://github.com/intel/neural-compressor/blob/master/neural_compressor/adaptor/onnxrt_qlinear.yaml, you can update 'per_channel' to 'per_tensor' of specific version ORT.

zihaomu commented 2 years ago

Hi, @mengniwang95. Thanks for your quick reply. Is there an option to directly quantize the entire model as per_tensor?

mengniwang95 commented 2 years ago

@zihaomu Unfortunately there is no other way. We will consider adding 'per_tensor' to framework yaml in next release.

zihaomu commented 2 years ago

Thanks. Looking forward to the next release.

zihaomu commented 2 years ago

Hi @mengniwang95, Are there any plans to release API for per-tensor quantization of entire model in the near future?

mengniwang95 commented 2 years ago

Hi, 1.12 version support per-tensor way. If you want to get per-tensor quantized model directly, pls add model_wise in yaml file like https://github.com/intel/neural-compressor/blob/aac0a0ec860d6d875467a8b7fb119ec18713fd48/neural_compressor/template/ptq.yaml#L43 and set 'granularity' to per_tensor

zihaomu commented 2 years ago

Hi, 1.12 version support per-tensor way. If you want to get per-tensor quantized model directly, pls add model_wise in yaml file like

https://github.com/intel/neural-compressor/blob/aac0a0ec860d6d875467a8b7fb119ec18713fd48/neural_compressor/template/ptq.yaml#L43

and set 'granularity' to per_tensor

Thanks @mengniwang95, this will be of great help to us.