intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.64k stars 1.26k forks source link

Provide quantization for nano #3600

Closed zhentaocc closed 2 years ago

zhentaocc commented 2 years ago

Description

Propose to integrate quantization methods into nano to reduce the model size and accelerate inference. Neural Compressor provides a set of methods to quantize a model to simplify the usage.
Discussion on the details is as below in comments.

Related tasks

jason-dai commented 2 years ago
  1. What's needed in the yaml file?
  2. How many parameters does the user need to customize?
  3. Can we use a Python dictionary (instead of a yaml file) for the user to specify customized parameters (on top of the default parameters)?
  4. Please describe calib_dataloader, val_dataloaders and datamodule
  5. Please also look at https://github.com/openvinotoolkit/nncf; preferably we should have an abstraction layer on top of INC and NNCF (for both PyTorch and TF).
zhentaocc commented 2 years ago

For 3

yes we can do that, and we must document all the keys for users.

For 1,2

As shown below, it is a template config for INC to run on Nano. I removed a great portion from INC template since they are required by Nano so we don't need to assign these keys in yaml. I'm providing 3 template config yaml files for users:

For a basic user, he or she can specify the tuning:

  1. Just specified accuracy_criterion to tell what kind of model is needed, like relative: 0.1 meaning 0.1 accuracy drop is tolerable.
  2. Additionally specify strategy, exit_policy

For an advanced user, he or she can specify the model_wise and op_wise to have a more restricted tuning space to reduce time for finding a satisfying quantized model.

version: 1.0                                          # optional. reserved for future use. if not specified, a supported version would be written back to user yaml.

model:                                                # mandatory. used to specify model specific information.
  name: default                                       # mandatory. the model name.
  framework: pytorch_fx                               # mandatory. supported values are tensorflow, pytorch, pytorch_fx, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnet; allow new framework backend extension.
                                                      # Only tested on pytorch_fx for now.
#  inputs: image_tensor                               # optional. inputs and outputs fields are only required in tensorflow.
#  outputs: num_detections,detection_boxes,detection_scores,detection_classes

device: cpu                                           # optional. options:[cpu, gpu]. Default: cpu.

quantization:                                         # optional. tuning constraints on model-wise for advance user to reduce tuning space.
  approach: post_training_static_quant                # optional. default value is post_training_static_quant.
#  recipes:                                           # optional. used to switch neural_compressor int8 receipts ON or OFF.
#    scale_propagation_max_pooling: True              # optional. default value is True.
#    scale_propagation_concat: True                   # optional. default value is True.
#    first_conv_or_matmul_quantization: True          # optional. default value is True.
  calibration:                                       # optional. used to specify calibration behavior of post-training-static-quant. other quantization approachs are not necessary.
    sampling_size: 1000, 2000                        # optional. default value is 100. used to set how many samples should be used in calibration.

#  model_wise:                                        # optional. tuning constraints on model-wise for advance user to reduce tuning space.
#    weight:
#      granularity: per_channel
#      scheme: asym
#      dtype: int8
#      algorithm: minmax
#    activation:
#      granularity: per_tensor
#      scheme: sym
#      dtype: int8
#      algorithm: minmax, kl

#  op_wise: {                                         # optional. tuning constraints on op-wise for advance user to reduce tuning space.
#         'conv1': {
#           'activation':  {'dtype': ['uint8', 'fp32'], 'algorithm': ['minmax', 'kl'], 'scheme':['sym']},
#           'weight': {'dtype': ['int8', 'fp32'], 'algorithm': ['minmax']}
#         },
#         'pool1': {
#           'activation': {'dtype': ['int8'], 'scheme': ['sym'], 'granularity': ['per_tensor'], 'algorithm': ['minmax', 'kl']},
#         },
#         'default_qconfig': {                       # optional. set default qconfig to fp32 for FX model
#           'activation':  {'dtype': ['fp32']},
#           'weight': {'dtype': ['fp32']}
#         }
#       }

tuning:
  strategy:
    name: bayesian                                      # optional. default value is basic. other values are bayesian, mse, sigopt.
#    sigopt_api_token: YOUR-ACCOUNT-API-TOKEN             # optional. Necessary if strategy name is sigopt.
#    sigopt_project_id: PROJECT-ID                    # optional. Necessary if strategy name is sigopt.
#    sigopt_experiment_name: nc-tune                # optional. default is nc-tune if strategy name is sigopt.
  accuracy_criterion:
    relative:  0.1                                  # optional. default value is relative, other value is absolute. this example allows relative accuracy loss: 1%.
    higher_is_better: True
  objective: performance                             # optional. objective with accuracy constraint guaranteed. default value is performance. other values are modelsize and footprint.

  exit_policy:
    timeout: 0                                       # optional. tuning timeout (seconds). default value is 0 which means early stop. combine with max_trials field to decide when to exit.
    max_trials: 1000                                  # optional. max tune times. default value is 100. combine with timeout field to decide when to exit.
    performance_only: False                          # optional. max tune times. default value is False which means only generate fully quantized model.
  random_seed: 9527                                  # optional. random seed for deterministic tuning.
  tensorboard: False                                  # optional. dump tensor distribution in evaluation phase for debug purpose. default value is False.

  workspace:
    path: /path/to/saving/directory                  # optional. default workspace is ./nc_workspace/current_time_stamp, saving tuning history and deploy yaml.

For 4 calib_dataloader, val_dataloaders and datamodule

jason-dai commented 2 years ago

1) We plan to support both INC and NNCF.

2) By default we should not ask the user to provide a yaml file; instead, just a minimum set of parameters that need to be specified in a python dictionary. yaml file can be supported for advanced users.

3) Providing three configurations (post_training_dynamic_quant, post_training_static_quant and quantization_aware_training) is very confusing; can we make it simpler?

zhentaocc commented 2 years ago
  1. We plan to support both INC and NNCF.
  2. By default we should not ask the user to provide a yaml file; instead, just a minimum set of parameters that need to be specified in a python dictionary. yaml file can be supported for advanced users.
  3. Providing three configurations (post_training_dynamic_quant, post_training_static_quant and quantization_aware_training) is very confusing; can we make it simpler?
  1. Will look at NNCF.

2&3. INC Quantization API can be:

from neural_compressor.experimental import Quantization
class QuantizationINC(Quantization):
    def __init__(self,  
                 basic_conf='default_quant.yaml', 
                 approach=None,  # post_training_dynamic_quant, post_training_static_quant and quantization_aware_training
                 strategy=None, # basic, bayesian, mse, tpe ...
                 accuracy_criterion: Dict = None, # {'relative': 0.1} , {'absolute': 0.98}
                 higher_is_better = None, # Is your metric higher meaning better?
                 timeout=None,         # if timeout==0, either max_trials or accuracy_criterion should be specified
                 max_trials = None,    # if not timeout==0, this key is not working 
                 ):   # None means no overriding
        ...

Usage can be:

quantizer_1 = QuantizationINC()  # default
quantizer_2 = QuantizationINC(approach='post_training_dynamic_quant', # Override default config by a minimum set of keys
                            strategy='bayesian',
                            accuracy_criterion={'relative': 0.1},
                            higher_is_better=True,)  
quantizer_3 = QuantizationINC(basic_conf='customized_conf.yaml')  # Advanced users can specify a more complicated conf by yaml
quantizer_4 = QuantizationINC(basic_conf='customized_conf.yaml', 
                            strategy='bayesian',                # Override config 
                            )   

And then pass model and dataloaders to quantize the model:

qmodel = quantizer(model, calib_dataloader, val_dataloader)

So we can have a default config and let users to choose static/dynamic/qat instead of 3 configs. We provides 6 keys I think will cover most cases for basic users and advanced users that need much more restrictions can pass a yaml config according to document and instructions.

zhentaocc commented 2 years ago

An introduction to quantization of NNCF/INC/Pytorch/Pytorch-lightning

NNCF

NNCF is a quantization tool provided by openvino, mainly focus on Quantization-Aware Training, supporting pytorch and tesnorflow. For pytorch, it uses pytorch FakeQuantize to make the model as a trainable quantized model and they have their own algorithm/implementation to quantize a model.

Pytorch

Pytorch has its own quantization APIs for users, for reference: Quantization for Pytorch. It has two modes for users to choose, eager mode and fx mode. In brief, users must define most of behaviors like Quant/DeQuant placement, what modules to fuse manually in eager mode, while in fx mode, everything is automatically done and users just have to make sure their module definition is complied with fx graph:
image Another thing is that static quantization and dynamic quantization has limited support in pytorch, which means users have to carefully choose static/dynamic quantization according to their models. For example, for seq2seq, a LSTM based model, dynamic quantization is a better choice, while for conv based models like tcn, static quantization should get better performance. image

INC

INC implementation is completely based pytorch quantization module. So it has the same limitation and similar features(eager vs fx, static vs dynamic) as pytorch. A great feature I saw in INC is that it provides automatically tuning for find the best quantized model by searching the tuning space defined by configuration. For example, if a user want to restrict accuracy drop within 0.1, then INC will search for a satisfying model even by fallback part of layers to FP32. But this process can be a bit time consuming.
If this tuning feature is not needed in Nano, I think we can simply use torch.quantization to implement the feature.

Pytorch-Lightning

Pytorch-Lightning has provided a Quantization-Aware-Training which uses eager mode in Pytorch. It is consistent with Pytorch QAT. This callback style is quite useful for us to override any function in a pytorch-lightning module.

Fundamental Difference

NNCF vs others: NNCF has its own engine to perform quantization while others uses torch.quantization which has lots of limitations as mentioned above. NNCF&PL vs INC&Pytorch : NNCF&PL are only for QAT. If post-training is needed, we can integrate POT from OpenVINO. Pytorch vs Tensorflow: I saw tf examples use keras. The main difference should be the different usage between TF and Pytorch.

@jason-dai @TheaperDeng FYI. I haven't tried all module mentioned. Will update once I have more information and give a more specific design for INC/NNCF and Pytorch/TF.

zhentaocc commented 2 years ago

Nano must update pytorch since NNCF do not support 1.8.0.

jason-dai commented 2 years ago

Nano must update pytorch since NNCF do not support 1.8.0.

Which PyTorch version does NNCF support? Is POT open sourced?

zhentaocc commented 2 years ago

@jason-dai NNCF requires PyTorch* >=1.5.0, <=1.9.1 (1.8.0 not supported).
POT: https://docs.openvino.ai/2021.1/pot_README.html
Seems to be open sourced. It's integrated in openvino.

zhentaocc commented 2 years ago

With NNCF API, we can do right now is to do quant-aware-training in pytorch, then export it to onnx and convert onnx to openvino. Finally the model can be compiled by openvino backend as a quantized model and run on reduced precision. Is there a plan to support OpenVINO? As shown below is a fake-quantized model, it simulates int8 but still run on fp32. @jason-dai @TheaperDeng image

jason-dai commented 2 years ago

We plan to support OpenVINO

jason-dai commented 2 years ago

Let's focus on add INC support first; just make sure the design can be easily extended to support NNCF in future.

zhentaocc commented 2 years ago

Let's focus on add INC support first; just make sure the design can be easily extended to support NNCF in future.

NNCF is a QAT only tool, so I think we can simply extend it by a NNCF callback:

def NNCF_Quant(Callback):
     ....

Use it by:

trainer = Trainer(callbacks=[NNCF_Quant(...)])
trainer.fit(...)

For INC, I think we can use QUANTIZATIONAWARETRAINING from pytorch-lightning for QAT, usage as below.

trainer = Trainer(callbacks=[QuantizationAwareTraining(...)])
trainer.fit(...)

As for PTQ, we can have an extra class to use like in https://github.com/intel-analytics/BigDL/pull/3602/files. Or another implementation could be:

class Trainer():
     def inc_quant(model, calib_dataloader, val_dataloader, framework, ...)
          ...
q_model = trainer.inc_quant(pl_model, calib_dataloader, val_dataloader, 'pytorch_fx')

@jason-dai Seems good? The other option is not using callbacks and directly have QAT and PTQ in one class.

TheaperDeng commented 2 years ago

I think we can focus on the PTQ for now and give an API like this (pytorch):

# bigdl.nano.pytorch.trainer
class Trainer(pl.trainer):
     ...
     def quantize(model, calib_dataloader, val_dataloader, backend="inc", param1=..., param2=..., ...)
         # backend="inc" for possible future backend support, you may use "inc_ipex" etc. to define other sub type
         # param1,2 are those config settings, it's OK if we leave a config_file parameter but
         # we may extract some important parameters and generate a config_file for the users if they don't have one.
          ...
     ...
q_model = trainer.quantize(pl_model, calib_dataloader, val_dataloader)
TheaperDeng commented 2 years ago

btw, is val_dataloader required? or we can just make it optional and just leave pl_model, calib_dataloader to be required

zhentaocc commented 2 years ago

btw, is val_dataloader required? or we can just make it optional and just leave pl_model, calib_dataloader to be required

It's required for tuning.

jason-dai commented 2 years ago
  1. By default, just trainer.quantize(model, calib_dataloader, val_dataloader=None) for PyTorch and model.quantize(calib_dataloader, val_dataloader=None) for Keras

  2. The user may specify additional config through method parameters (using Python dictionary when needed)

  3. Advanced users may optionally provide a config file

zhentaocc commented 2 years ago

How about:

    def quantize(self, model, calib_dataloader, val_dataloader=None, metric: str = None,
                 backend='inc', conf=None, framework='pytorch_fx', approach='ptsq',
                 strategy='bayesian', accuracy_criterion=None, timeout=0, max_trials=1)

You can refer to https://github.com/intel-analytics/BigDL/pull/3602.

TheaperDeng commented 2 years ago

I think we can have some thing like this. We may recheck them and rename some of them to be more understandable later. btw, what is framework's valid option?

So, when the users use it, they have such a workflow:

Train.fit(pl_model, dataloader)
pl_model_quantized = Train.quantize(pl_model, dataloader, ...)  # PTQ
pred = pl_model_quantized(x)  # quantized inference
zhentaocc commented 2 years ago

I think we can have some thing like this. We may recheck them and rename some of them to be more understandable later. btw, what is framework's valid option? tensorflow, pytorch, pytorch_fx, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnet

So, when the users use it, they have such a workflow:

Train.fit(pl_model, dataloader)
pl_model_quantized = Train.quantize(pl_model, dataloader, ...)  # PTQ
pred = pl_model_quantized(x)  # quantized inference

Do you mean returning a pytorch module instead of pl module? I was intended to return a pl model so users can do :

trainer.predict(pl_model_quantized, ...)
TheaperDeng commented 2 years ago

I think we can have some thing like this. We may recheck them and rename some of them to be more understandable later. btw, what is framework's valid option? tensorflow, pytorch, pytorch_fx, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnet

So, when the users use it, they have such a workflow:

Train.fit(pl_model, dataloader)
pl_model_quantized = Train.quantize(pl_model, dataloader, ...)  # PTQ
pred = pl_model_quantized(x)  # quantized inference

Do you mean returning a pytorch module instead of pl module? I was intended to return a pl model so users can do :

trainer.predict(pl_model_quantized, ...)

No, return a pl model is what I mean.