Open alessandroaimar opened 4 years ago
@nutsiepully
I second that: quantization aware training is actually tied to 8-bit default quantization scheme because of that.
Even changing the default QuantizeConfig to configure TFMOT default quantizers to use 4-bit seems difficult without hacks: I was able to generate 4-bit QAT integer models that includes BatchNorm, but only using a nasty workaround that I managed to find after reading the code: I had to manually edit the Conv2D activation to force it to 'NoOpActivation' so that the hard-wired transforms behave as expected.
Here is the corresponding gist: https://gist.github.com/corvoysier/c49c09ce1aad7278c84b1e237fdc27b3
It contains three files:
bug_nbit_qat_transforms.py
n_bit_qat_helpers.py
tflite_helpers.py
Without workaround: python bug_nbit_qat_transforms.py --save
With workaround: python bug_nbit_qat_transforms.py -w --save
@alessandroaimar - Thanks for your feedback. Your concern is valid that the default_8bit transforms get applied to all models.
However, don't worry we have a planned modification for this. The initial version of the library hasn't fully exposed the APIs that will be used for experimentation. The goal of the initial version was to provide a default path that works, and have some experimental support for custom modifications.
We'll be providing an API modification which allows users to override both the QuantizeLayoutTransform
and the QuantizeRegistry
for their application code. So this will completely remove those issues. The default transforms will only get applied for the default cases :)
@corvoysier - As mentioned, the upcoming update to the API will take care of the transforms. When using custom quantization configuration, you can disable them.
Even changing the default QuantizeConfig to configure TFMOT default quantizers to use 4-bit seems difficult without hacks: I was able to generate 4-bit QAT integer models that includes BatchNorm, but only using a nasty workaround that I managed to find after reading the code: I had to manually edit the Conv2D activation to force it to 'NoOpActivation' so that the hard-wired transforms behave as expected.
I went through your code. I think there is some confusion as to the support offered by the library.
Please note that when using custom quantization such as 4-bits, the support only extends to the Keras API, not the converter or execution in TFLite. TFLite only has 8-bit kernels at the moment, and only supports the 8-bit quantization scheme. If you want support for 4-bit, it's your responsibility to write those kernels, and ensure conversion.
The library itself allows you to emulate and predict quantized model accuracy with QAT for various bit configurations. However, the conversion/execution on those target environments need to be supported by you. The default support only extends to the 8-bit spec.
In this case, the code does exactly as expected. If you look here, the code explicitly does not apply the NoOpActivation
change because there is custom quantize config. In the presence of custom config, the library cannot know about the target environment and it has to be supported by the user.
In your code here, you didn't need to use NoOpActivation
. That's an internal implementation detail. Simply changing your Conv2DQuantizeConfig
to not use an ActivationQuantizer
would have done the trick.
I'm sorry you had a bad experience with this, but please let me know if there is anything else we can do to make it clear what the support of the library is, or how to make the custom quantize config experience easier.
@nutsiepully thank you very much for the detailed answer.
I went through your code. I think there is some confusion as to the support offered by the library.
Yes, although the number of bits is configurable in various places, I understand now that the target quantization type is 8-bit.
If you want support for 4-bit, it's your responsibility to write those kernels, and ensure conversion.
That was also my conclusion, although I don't know how I would make this kernels known from the MLIR converter: do you have pointers to some examples ?
In your code here, you didn't need to use
NoOpActivation
. That's an internal implementation detail. Simply changing yourConv2DQuantizeConfig
to not use anActivationQuantizer
would have done the trick.
I actually wanted the BatchNorm to be folded, that's why I did this hack. If I had been able to do my own transforms, I would have done it differently: I assume I must do my transforms before calling quantize_apply.
I'm sorry you had a bad experience with this, but please let me know if there is anything else we can do to make it clear what the support of the library is, or how to make the custom quantize config experience easier.
Well I think you should make it clear that only 8-bit quantization is supported. This should be the first thing you read when going through the "Modifying quantization parameters" section of the comprehensive guide.
The library itself allows you to emulate and predict quantized model accuracy with QAT for various bit configurations. However, the conversion/execution on those target environments need to be supported by you. The default support only extends to the 8-bit spec.
The example is also misleading, mentioning explicitly a 4-bit quantization of weights, but never mentioning that the resulting converted model will have 8-bit weights quantized to 4-bit.
Well I think you should make it clear that only 8-bit quantization is supported. This should be the first thing you read when going through the "Modifying quantization parameters" section of the comprehensive guide.`
We'll clarify the documentation so that future users have a better experience. To make the changes so you wouldn't have run into this, we'll need to consider how you went through and interpreted different parts of the docs.
There are two related pieces of documentation:
Did you across these sentences prior? This is seeing if we should make them more discoverable since as you said, there is nothing explicit inside"Modifying quantization parameters" (and the other "Experiment with quantization" subsections) and the sentences are apart from each other.
If you did come across them, was it that "there is no supported path to deployment" is too ambiguous? Would changing it to something like "there is no supported path to deployment, including conversion to a quantized model and running inference on any backend" have helped? Would a reference to "e.g. number of bits" like the overview page help?
Thanks for the suggestions.
If you did come across them, was it that "there is no supported path to deployment" is too ambiguous? Would changing it to something like "there is no supported path to deployment, including conversion to a quantized model and running inference on any backend" have helped? Would a reference to "e.g. number of bits" like the overview page help?
Without prior knowledge that TFLite only supports the deployment of 8-bit integer models, "there is no path to deployment" is ambiguous, and I definitely interpreted it like "you just need to provide your alternate QuantizeConfig", like the example seems to imply.
Also, according to research papers and my own experience with quantization, when targeting 8-bit, post-training quantization already gives very good results. It is only when you want to target 4-bit or lower that you have to rely on QAT because otherwise the accuracy drop it too hgh: it is therefore very likely that people will be led to tfmot QAT with low-bitwidth support expectations.
@alessandroaimar - Thanks for your feedback. Your concern is valid that the default_8bit transforms get applied to all models.
However, don't worry we have a planned modification for this. The initial version of the library hasn't fully exposed the APIs that will be used for experimentation. The goal of the initial version was to provide a default path that works, and have some experimental support for custom modifications.
We'll be providing an API modification which allows users to override both the
QuantizeLayoutTransform
and theQuantizeRegistry
for their application code. So this will completely remove those issues. The default transforms will only get applied for the default cases :)
Thank you very much for your answer. Is it there an ETA for such changes?
Let's make the doc clear and close the issue.
After reading the quantization aware training documentation, I still have followeing questions and appreciate if anyone can help.
Is it 8-bit floating point or 8-bit integer? what is that mean that quantization aware training doesn't change the trained weights? If I load my trained model and do quantization and then do training what will happens? After training a quantized model, my inference speed is lower. so what is the purpose of it?
Thank you.
In file
model-optimization/tensorflow_model_optimization/python/core/quantization/keras/quantize.py
at line 407 (function quantize_apply(model)) the model is transformed using the default_8bit_quantize_layout_transform that is necessary for the edge tpu. However, this can interfere if the quantization is custom since there is no obvious way to avoid it and the only way to know what is happening is to actually open the source code.
Note that this may virtually be a serious bug invalidating all the work done using this API: all quantized model, independently from their annotation, get transformed using the default transform defined in
tensorflow_model_optimization\python\core\quantization\keras\default_8bit\default_8bit_quantize_layout_transform.py
This involves unwanted changes on several layers, the most critical being the input layer (that gets transformed with an undocumented MovingAverageQuantizer) and the batch norm (that gets fused with the conv layers). Virtually every research paper written using this API has to be withdrawn or amended.