Open gaikwadrahul8 opened 4 days ago
This issue originally reported by @BenCrulis has been moved to this dedicated repository for LiteRT to enhance issue tracking and prioritization. To ensure continuity, we have created this new issue on your behalf.
We appreciate your understanding and look forward to your continued involvement.
1. System information
WSL Linux l 5.14.0-427.18.1.el9_4.x86_64 GNU/Linux tensorflow==2.10.1 tensorflow-cpu==2.10.1 installed using pip
2. Code
Ignore the fact that the dequantization process is currently wrong, this is just for testing.
3. conversion
The conversion is successful, but the model size is that of a model which stores the full 1000*500 weight matrix in int8 format, which is about 500KB, when it should store the packed weights and weigh ~63KB.
I assume this is the result of the constant folding pass of the converter which stores the weights that have just been casted to float32, instead of storing the int8 weights and re-doing the dequantization process each time. This can be seen in the graph of the resulting tflite file: (I am not sure why the tensor is not saved after the transpose instead)
This is also evidenced by the fact that I can replace
compressed_weights = self.kernel
withcompressed_weights = self.kernel + tf.cast(x[0,0], dtype=tf.int8) * 0
and have the compressed weights saved on disk this way, becausex
cannot be constant-folded. However, this cost extra operations and forces me to activate additional supported ops in the converter, which is not ideal.Note that I also tried adding this code, but it does not change anything:
So, is there a way to prevent constant folding? Perhaps with a global flag but preferably by introducing a no-op in the graph at a specific point to prevent the folding of just these nodes.
Maybe there is also a way to guarantee the storage of a particular parameter in packed form for binary and ternary weights?