Closed zhentaocc closed 2 years ago
calib_dataloader
, val_dataloaders
and datamodule
yes we can do that, and we must document all the keys for users.
As shown below, it is a template config for INC to run on Nano. I removed a great portion from INC template since they are required by Nano so we don't need to assign these keys in yaml. I'm providing 3 template config yaml files for users:
For a basic user, he or she can specify the tuning
:
accuracy_criterion
to tell what kind of model is needed, like relative: 0.1
meaning 0.1 accuracy drop is tolerable.strategy
, exit_policy
For an advanced user, he or she can specify the model_wise
and op_wise
to have a more restricted tuning space to reduce time for finding a satisfying quantized model.
version: 1.0 # optional. reserved for future use. if not specified, a supported version would be written back to user yaml.
model: # mandatory. used to specify model specific information.
name: default # mandatory. the model name.
framework: pytorch_fx # mandatory. supported values are tensorflow, pytorch, pytorch_fx, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnet; allow new framework backend extension.
# Only tested on pytorch_fx for now.
# inputs: image_tensor # optional. inputs and outputs fields are only required in tensorflow.
# outputs: num_detections,detection_boxes,detection_scores,detection_classes
device: cpu # optional. options:[cpu, gpu]. Default: cpu.
quantization: # optional. tuning constraints on model-wise for advance user to reduce tuning space.
approach: post_training_static_quant # optional. default value is post_training_static_quant.
# recipes: # optional. used to switch neural_compressor int8 receipts ON or OFF.
# scale_propagation_max_pooling: True # optional. default value is True.
# scale_propagation_concat: True # optional. default value is True.
# first_conv_or_matmul_quantization: True # optional. default value is True.
calibration: # optional. used to specify calibration behavior of post-training-static-quant. other quantization approachs are not necessary.
sampling_size: 1000, 2000 # optional. default value is 100. used to set how many samples should be used in calibration.
# model_wise: # optional. tuning constraints on model-wise for advance user to reduce tuning space.
# weight:
# granularity: per_channel
# scheme: asym
# dtype: int8
# algorithm: minmax
# activation:
# granularity: per_tensor
# scheme: sym
# dtype: int8
# algorithm: minmax, kl
# op_wise: { # optional. tuning constraints on op-wise for advance user to reduce tuning space.
# 'conv1': {
# 'activation': {'dtype': ['uint8', 'fp32'], 'algorithm': ['minmax', 'kl'], 'scheme':['sym']},
# 'weight': {'dtype': ['int8', 'fp32'], 'algorithm': ['minmax']}
# },
# 'pool1': {
# 'activation': {'dtype': ['int8'], 'scheme': ['sym'], 'granularity': ['per_tensor'], 'algorithm': ['minmax', 'kl']},
# },
# 'default_qconfig': { # optional. set default qconfig to fp32 for FX model
# 'activation': {'dtype': ['fp32']},
# 'weight': {'dtype': ['fp32']}
# }
# }
tuning:
strategy:
name: bayesian # optional. default value is basic. other values are bayesian, mse, sigopt.
# sigopt_api_token: YOUR-ACCOUNT-API-TOKEN # optional. Necessary if strategy name is sigopt.
# sigopt_project_id: PROJECT-ID # optional. Necessary if strategy name is sigopt.
# sigopt_experiment_name: nc-tune # optional. default is nc-tune if strategy name is sigopt.
accuracy_criterion:
relative: 0.1 # optional. default value is relative, other value is absolute. this example allows relative accuracy loss: 1%.
higher_is_better: True
objective: performance # optional. objective with accuracy constraint guaranteed. default value is performance. other values are modelsize and footprint.
exit_policy:
timeout: 0 # optional. tuning timeout (seconds). default value is 0 which means early stop. combine with max_trials field to decide when to exit.
max_trials: 1000 # optional. max tune times. default value is 100. combine with timeout field to decide when to exit.
performance_only: False # optional. max tune times. default value is False which means only generate fully quantized model.
random_seed: 9527 # optional. random seed for deterministic tuning.
tensorboard: False # optional. dump tensor distribution in evaluation phase for debug purpose. default value is False.
workspace:
path: /path/to/saving/directory # optional. default workspace is ./nc_workspace/current_time_stamp, saving tuning history and deploy yaml.
calib_dataloader
, val_dataloaders
and datamodule
quantization_aware_training
: the same to trianer.fit(train_dataloaders, val_dataloaders, datamodule)
. You don't need to specify all 3. calib_dataloader
is used to do calibration and val_dataloaders
is used for evaluation to see how quantized model works. If calib/val loaders is not specified, then datamodule should be specified with datamodule.train_loaders()
(used as calib_dataloader
)and datamodule.val_loaders()
not None. In INC, evaluation is used to search for an optimal quantized model to meet users specifying tuning space.
Is it meaning we need to support Pytorch and Tensorflow with both INC and NNCF? Or either of INC and NNCF?
1) We plan to support both INC and NNCF.
2) By default we should not ask the user to provide a yaml file; instead, just a minimum set of parameters that need to be specified in a python dictionary. yaml file can be supported for advanced users.
3) Providing three configurations (post_training_dynamic_quant, post_training_static_quant and quantization_aware_training) is very confusing; can we make it simpler?
- We plan to support both INC and NNCF.
- By default we should not ask the user to provide a yaml file; instead, just a minimum set of parameters that need to be specified in a python dictionary. yaml file can be supported for advanced users.
- Providing three configurations (post_training_dynamic_quant, post_training_static_quant and quantization_aware_training) is very confusing; can we make it simpler?
2&3. INC Quantization API can be:
from neural_compressor.experimental import Quantization
class QuantizationINC(Quantization):
def __init__(self,
basic_conf='default_quant.yaml',
approach=None, # post_training_dynamic_quant, post_training_static_quant and quantization_aware_training
strategy=None, # basic, bayesian, mse, tpe ...
accuracy_criterion: Dict = None, # {'relative': 0.1} , {'absolute': 0.98}
higher_is_better = None, # Is your metric higher meaning better?
timeout=None, # if timeout==0, either max_trials or accuracy_criterion should be specified
max_trials = None, # if not timeout==0, this key is not working
): # None means no overriding
...
Usage can be:
quantizer_1 = QuantizationINC() # default
quantizer_2 = QuantizationINC(approach='post_training_dynamic_quant', # Override default config by a minimum set of keys
strategy='bayesian',
accuracy_criterion={'relative': 0.1},
higher_is_better=True,)
quantizer_3 = QuantizationINC(basic_conf='customized_conf.yaml') # Advanced users can specify a more complicated conf by yaml
quantizer_4 = QuantizationINC(basic_conf='customized_conf.yaml',
strategy='bayesian', # Override config
)
And then pass model and dataloaders to quantize the model:
qmodel = quantizer(model, calib_dataloader, val_dataloader)
So we can have a default config and let users to choose static/dynamic/qat instead of 3 configs. We provides 6 keys I think will cover most cases for basic users and advanced users that need much more restrictions can pass a yaml config according to document and instructions.
NNCF is a quantization tool provided by openvino, mainly focus on Quantization-Aware Training, supporting pytorch and tesnorflow. For pytorch, it uses pytorch FakeQuantize
to make the model as a trainable quantized model and they have their own algorithm/implementation to quantize a model.
Pytorch has its own quantization APIs for users, for reference: Quantization for Pytorch. It has two modes for users to choose, eager mode and fx mode. In brief, users must define most of behaviors like Quant/DeQuant placement, what modules to fuse manually in eager mode, while in fx mode, everything is automatically done and users just have to make sure their module definition is complied with fx graph:
Another thing is that static quantization and dynamic quantization has limited support in pytorch, which means users have to carefully choose static/dynamic quantization according to their models. For example, for seq2seq, a LSTM based model, dynamic quantization is a better choice, while for conv based models like tcn, static quantization should get better performance.
INC implementation is completely based pytorch quantization module. So it has the same limitation and similar features(eager vs fx, static vs dynamic) as pytorch. A great feature I saw in INC is that it provides automatically tuning for find the best quantized model by searching the tuning space defined by configuration. For example, if a user want to restrict accuracy drop within 0.1, then INC will search for a satisfying model even by fallback part of layers to FP32. But this process can be a bit time consuming.
If this tuning feature is not needed in Nano, I think we can simply use torch.quantization
to implement the feature.
Pytorch-Lightning has provided a Quantization-Aware-Training which uses eager mode in Pytorch. It is consistent with Pytorch QAT. This callback style is quite useful for us to override any function in a pytorch-lightning module.
NNCF vs others: NNCF has its own engine to perform quantization while others uses torch.quantization
which has lots of limitations as mentioned above.
NNCF&PL vs INC&Pytorch : NNCF&PL are only for QAT. If post-training is needed, we can integrate POT from OpenVINO.
Pytorch vs Tensorflow: I saw tf examples use keras. The main difference should be the different usage between TF and Pytorch.
@jason-dai @TheaperDeng FYI. I haven't tried all module mentioned. Will update once I have more information and give a more specific design for INC/NNCF and Pytorch/TF.
Nano must update pytorch since NNCF do not support 1.8.0.
Nano must update pytorch since NNCF do not support 1.8.0.
Which PyTorch version does NNCF support? Is POT open sourced?
@jason-dai NNCF requires PyTorch* >=1.5.0, <=1.9.1 (1.8.0 not supported).
POT: https://docs.openvino.ai/2021.1/pot_README.html
Seems to be open sourced. It's integrated in openvino.
With NNCF API, we can do right now is to do quant-aware-training in pytorch, then export it to onnx and convert onnx to openvino. Finally the model can be compiled by openvino backend as a quantized model and run on reduced precision. Is there a plan to support OpenVINO? As shown below is a fake-quantized model, it simulates int8 but still run on fp32. @jason-dai @TheaperDeng
We plan to support OpenVINO
Let's focus on add INC support first; just make sure the design can be easily extended to support NNCF in future.
Let's focus on add INC support first; just make sure the design can be easily extended to support NNCF in future.
NNCF is a QAT only tool, so I think we can simply extend it by a NNCF callback:
def NNCF_Quant(Callback):
....
Use it by:
trainer = Trainer(callbacks=[NNCF_Quant(...)])
trainer.fit(...)
For INC, I think we can use QUANTIZATIONAWARETRAINING from pytorch-lightning for QAT, usage as below.
trainer = Trainer(callbacks=[QuantizationAwareTraining(...)])
trainer.fit(...)
As for PTQ, we can have an extra class to use like in https://github.com/intel-analytics/BigDL/pull/3602/files. Or another implementation could be:
class Trainer():
def inc_quant(model, calib_dataloader, val_dataloader, framework, ...)
...
q_model = trainer.inc_quant(pl_model, calib_dataloader, val_dataloader, 'pytorch_fx')
@jason-dai Seems good? The other option is not using callbacks and directly have QAT and PTQ in one class.
I think we can focus on the PTQ for now and give an API like this (pytorch):
# bigdl.nano.pytorch.trainer
class Trainer(pl.trainer):
...
def quantize(model, calib_dataloader, val_dataloader, backend="inc", param1=..., param2=..., ...)
# backend="inc" for possible future backend support, you may use "inc_ipex" etc. to define other sub type
# param1,2 are those config settings, it's OK if we leave a config_file parameter but
# we may extract some important parameters and generate a config_file for the users if they don't have one.
...
...
q_model = trainer.quantize(pl_model, calib_dataloader, val_dataloader)
btw, is val_dataloader
required? or we can just make it optional and just leave pl_model, calib_dataloader to be required
btw, is
val_dataloader
required? or we can just make it optional and just leave pl_model, calib_dataloader to be required
It's required for tuning.
By default, just trainer.quantize(model, calib_dataloader, val_dataloader=None)
for PyTorch and model.quantize(calib_dataloader, val_dataloader=None)
for Keras
The user may specify additional config through method parameters (using Python dictionary when needed)
Advanced users may optionally provide a config file
How about:
def quantize(self, model, calib_dataloader, val_dataloader=None, metric: str = None,
backend='inc', conf=None, framework='pytorch_fx', approach='ptsq',
strategy='bayesian', accuracy_criterion=None, timeout=0, max_trials=1)
You can refer to https://github.com/intel-analytics/BigDL/pull/3602.
I think we can have some thing like this. We may recheck them and rename some of them to be more understandable later.
btw, what is framework
's valid option?
So, when the users use it, they have such a workflow:
Train.fit(pl_model, dataloader)
pl_model_quantized = Train.quantize(pl_model, dataloader, ...) # PTQ
pred = pl_model_quantized(x) # quantized inference
I think we can have some thing like this. We may recheck them and rename some of them to be more understandable later. btw, what is
framework
's valid option? tensorflow, pytorch, pytorch_fx, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnetSo, when the users use it, they have such a workflow:
Train.fit(pl_model, dataloader) pl_model_quantized = Train.quantize(pl_model, dataloader, ...) # PTQ pred = pl_model_quantized(x) # quantized inference
Do you mean returning a pytorch module instead of pl module? I was intended to return a pl model so users can do :
trainer.predict(pl_model_quantized, ...)
I think we can have some thing like this. We may recheck them and rename some of them to be more understandable later. btw, what is
framework
's valid option? tensorflow, pytorch, pytorch_fx, pytorch_ipex, onnxrt_integer, onnxrt_qlinear or mxnetSo, when the users use it, they have such a workflow:
Train.fit(pl_model, dataloader) pl_model_quantized = Train.quantize(pl_model, dataloader, ...) # PTQ pred = pl_model_quantized(x) # quantized inference
Do you mean returning a pytorch module instead of pl module? I was intended to return a pl model so users can do :
trainer.predict(pl_model_quantized, ...)
No, return a pl model is what I mean.
Description
Propose to integrate quantization methods into nano to reduce the model size and accelerate inference. Neural Compressor provides a set of methods to quantize a model to simplify the usage.
Discussion on the details is as below in comments.
Related tasks