Kernel weights and activations not quantized after training

YogaVicky commented 2 years ago

Hi there! I was interested in implementing the Qkeras example for MNIST CNN model as given in the examples section - Link. This examples involves quantizing the weights and activations into INT4 or 4 bits using the quantized_bits(4,0,1) method for Conv kernels and activations. I was expecting the weights and activations to be in INT4 but they were in FP32 and there wasn't any integer left of decimal point. I ran some experiments with the quantized_bits() method and the results were quantized! And here are the weights and activations for the MNIST model after model_save_quantized_weights(): I would essentially want to save the quantized model with the INT8 or INT4 weights and convert into a TRT engine and do GPU inferencing. Any pointers? Thanks, Yoga

jurevreca12 commented 2 years ago

Perhaps you are failing to take into account the scaling factors? When you save the weights they are "quasi" quantized, they are actually floats that can only take up a discreet set of values. By using the scaling factors you can get back the "integer" value. Additionally, if you wish to get "integer" number without any decimals you should set the integer parameter to bits -1 for signed and bits-1 for unsigned numbers.

YogaVicky commented 2 years ago

By scaling factor I guess you mean alpha?I tried setting up the quantized bits() with integer=bits-1 and it worked with alpha=1,but when I set it to none,shouldn't it automatically take the value of 1? Moreover,could you explain this statement a bit? The parameters bits specify the number of bits for the quantization, and integer specifies how many bits of bits are to the left of the decimal point. - mentioned in one of the files. Technically I want to convert FP32 into INT8,so what are the parameters I should use inside quantized_bits()?

Thanks, Yoga

jurevreca12 commented 2 years ago

When you set alpha equal to a tensor, then they will be the (constant) scaling factor(there is a separate variable in the code called scale). However, this is not really a good approach, as the performance of your network will likely be poor, unless you use a lot of bits. When you set alpha=None, it will automatically set it to "auto_po2" (the other value is "auto"). In this mode the quantizer will determine the best power-of-two scaling factor so that the integers will best represent your floating-point weights.

With regard to the statement on the number of bits. If you want to have INT8 parameters, then I think you should set bits=8, integer=7, keep_negative=True. But even then (if you set alpha to="auto_po2"), you will check the weights in your saved model will look like decimal numbers, i.e. "0.125". By using the scaling factors you will be able to calculate which integer value they represent. I.e. suppose you have a vector "0.125, 0.25, 0.5..." with a scaling factor 0.125 this is "1, 2, 4...". Maybe look at this paper https://arxiv.org/pdf/1712.05877.pdf for additional information on this.

Unfortunately the documentation on quantized_bits is a bit sparse, so I understand your confusion. You really have to look at the code (maybe try stepping through it in a debugger) to understand it.

YogaVicky commented 2 years ago

So any idea about what the integer parameter really specifies? It is said to be the number of bits left of the decimal point. I get the scaling factor point,but isn't the bits and integer parameter misleading?

jurevreca12 commented 2 years ago

It is a little misleading in some cases yes. But maybe you are looking it from a perspective of whole numbers. If you consider the quantizer as a fixed-point quantizer (i.e. http://www.digitalsignallabs.com/fp.pdf) then the "integer" parameter does signify bits left of the decimal point. So unsigned numbers are a special case for the quantizer, when bits=integer. The signed numbers (twos complement) are also a special case when integer=bits-1 (one bit goes for the sign).

YogaVicky commented 2 years ago

So I am not to going to get a 7 bit number(8th bit - sign) as weight when I use quantized_bits(bits=8,integer=7) right? Instead I'll be getting a scaled down version of all the weights?

jurevreca12 commented 2 years ago

With quantized_bits(bits=8, integer=7, keep_negative=True) you will get 2^bits different uniformly distributed values. These weights can be represented with an 8 bit signed (twos complement) number in hardware. Regarding the scaling. The scaling factors allow you to calculate approximately the result you would get if you would do calculations in float. So if you have quantized weights WQ in INT8, and quantized input XQ INT8. Then: ((WQ XQ) + bias)scale is the same as (W*X + bias) if W and X are normal float weights and inputs.

YogaVicky commented 2 years ago

Thanks for the replies @jurevreca12! So we now have the quantized weights that have to be scaled up to integer weights in case of INT8 or INT4 GPU Inferencing.In NVIDIA QAT we have a similar FP32 weights which when passed through TensorRT gets scaled to INT8. I tried the same with QKeras and TRT wasn't able to recognise the quantized kernels. Any idea of how you go about GPU inferencing with QKeras?

Thanks, Yoga

jurevreca12 commented 2 years ago

I am not an author of QKeras, but I would imagine getting these networks to run on a GPU is non-trivial. Quantization-aware training is more of a research field, so the infrastructure is quite lacking. One way to get such a model to run on your GPU, is to implement the kernels in CUDA. Alternatively, you would need to convert between the qkeras save format to some TensorRT format (assuming it supports all the desired operations). Both of these options are quite involved. But maybe there is an easier way, that I just don't know about.

YogaVicky commented 2 years ago

By the way @jurevreca12, is there any way to see what scale is being used for lets say some kernel quantizer or bias quantizer?

jurevreca12 commented 2 years ago

@YogaVicky Yes. You can see the scale being used in the "scale" variable of the quantizer. But depending on the configuration the scale can be recomputed depending on the inputs.

google / qkeras

Kernel weights and activations not quantized after training #96