intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.14k stars 250 forks source link

[Tensorflow] Question: PTQ and QAT #38

Closed peiwenhuang27 closed 2 years ago

peiwenhuang27 commented 2 years ago

Hi, may I ask some questions based on my understanding of the source code please:

1. Conv2D

As far as I know, in post-training quantization, Conv2D supports both Conv2DBiasAddRelu and Conv2DBiasAddLeakyRelu through FuseNodeStartWithConv2d.apply_conv_biasadd_relu_fusion(...). However, the key difference is that with Leaky ReLU, quantized values cannot be directly passed to next quantized Conv2D due to the positive inputs constraint, so QuantizedConv2DWithBiasAndRelu will first dequantize to pass through Leaky ReLU, and then quantize again into next QuantizedConv2DWithBiasAndRelu.

So, if I have a quantization-aware trained model with Conv2DBiasAddLeakyRelu pattern, is it also converted to quantized model in the same manner? That is, regardless of the quantization method, in order to pass through Leaky ReLU, the predecessor node must first dequantize and the successor node must add a quantize input layer, is that correct?

2. LSTM

I noticed the following lines: https://github.com/intel/neural-compressor/blob/1bddfcb5609c8f4643a33b5d7f359138090464ee/neural_compressor/adaptor/tf_utils/quantize_graph/quantize_graph_matmul.py#L121-L125

Does this mean quantization for LSTM is currently not supported?

Thanks!

guomingz commented 2 years ago

For Conv2D questions, i don't think there's need to insert additional dequantize/quantize before next QuantizedConv2DWithBiasAndRelu as this op supports s8 input.

For LTSM, i remember we already supported LTSM mode since v1.6 release

ftian1 commented 2 years ago

close it if no further questions