intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.18k stars 252 forks source link

Error in fp8 quantization: Invalid scale factor : 1.70e+06, make sure the scale is not larger than : 6.55e+04 #1907

Open yyChen233 opened 2 months ago

yyChen233 commented 2 months ago

When I use this config to quantize a YOLOv3 model into fp8: `version: 1.0

model: # mandatory. used to specify model specific information. name: yolo_v3 framework: pytorch # mandatory. possible values are tensorflow, mxnet, pytorch, pytorch_ipex, onnxrt_integerops and onnxrt_qlinearops.

quantization: approach: post_training_static_quant # no need for fp8_e5m2 precision: fp8_e4m3 # allowed precision is fp8_e5m2, fp8_e4m3, fp8_e3m4 calibration:

batchnorm_sampling_size: 3000 # only needed for models w/ BatchNorm

    sampling_size: 104

tuning: accuracy_criterion: relative: 0.01 # optional. default value is relative, other value is absolute. this example allows relative accuracy loss: 1%. exit_policy: max_trials: 50

timeout: 180 # optional. tuning timeout (seconds). default value is 0 which means early stop. combine with max_trials field to decide when to exit.

random_seed: 1234 # optional. random seed for deterministic tuning. `

I got these output: 2024-07-09 17:27:54 [INFO] Save tuning history to /mnt/d/LM/neural-compressor/examples/pytorch/object_detection/yolo_v3/quantization/ptq/eager/nc_workspace/2024-07-09_17-27-50/./history.snapshot. 2024-07-09 17:27:54 [INFO] FP32 baseline is: [Accuracy: 0.7232, Duration (seconds): 3.5848] Error: Invalid scale factor : 1.70e+06, make sure the scale is not larger than : 6.55e+04

So how can I handle this problem? Thank you!