google-ai-edge / LiteRT

LiteRT is the new name for TensorFlow Lite (TFLite). While the name is new, it's still the same trusted, high-performance runtime for on-device AI, now with an expanded vision.
https://ai.google.dev/edge/litert
Apache License 2.0
169 stars 13 forks source link

The constant folding pass of the TFLite converter prevent storing packed tensors, stores dequantized tensors instead #60

Open gaikwadrahul8 opened 4 days ago

gaikwadrahul8 commented 4 days ago

1. System information

WSL Linux l 5.14.0-427.18.1.el9_4.x86_64 GNU/Linux tensorflow==2.10.1 tensorflow-cpu==2.10.1 installed using pip

2. Code

Ignore the fact that the dequantization process is currently wrong, this is just for testing.

class TestBinary(Layer):
    def __init__(self, units):
        super().__init__()
        self.units = units

    def build(self, input_shape):
        self.input_size = input_shape[-1]
        assert input_shape[-1] * self.units % 8 == 0
        compressed_size = int(input_shape[-1] * self.units / 8)
        self.kernel = self.add_weight(shape=(compressed_size,), initializer="ones", dtype=tf.int8)

    def call(self, x):
        compressed_weights = self.kernel
        tensors = []
        for i in range(8):
            tensors.append(compressed_weights)  # will use a formula to dequantize in the future

        kernel = tf.stack(tensors)
        kernel = tf.cast(kernel, tf.float32)  # seems to store the variable at this point on disk
        kernel = tf.reshape(kernel, (self.input_size, self.units))
        x = x @ kernel
        return x

quantDense = TestBinary(1000)
test_input_shape = (500,)
quantDense(tf.ones((1, *test_input_shape))) # initialize the weights

converter = tf.lite.TFLiteConverter.from_keras_model(quantDense)
converted = converter.convert()

with open("test.tflite", "wb") as f:
    f.write(converted)

3. conversion

The conversion is successful, but the model size is that of a model which stores the full 1000*500 weight matrix in int8 format, which is about 500KB, when it should store the packed weights and weigh ~63KB.

I assume this is the result of the constant folding pass of the converter which stores the weights that have just been casted to float32, instead of storing the int8 weights and re-doing the dequantization process each time. This can be seen in the graph of the resulting tflite file: image (I am not sure why the tensor is not saved after the transpose instead)

This is also evidenced by the fact that I can replace compressed_weights = self.kernel with compressed_weights = self.kernel + tf.cast(x[0,0], dtype=tf.int8) * 0 and have the compressed weights saved on disk this way, because x cannot be constant-folded. However, this cost extra operations and forces me to activate additional supported ops in the converter, which is not ideal.

Note that I also tried adding this code, but it does not change anything:

tf.config.optimizer.set_experimental_options({
    "constant_folding": False,
    "disable_model_pruning": True,
    "remapping": False,
    })

So, is there a way to prevent constant folding? Perhaps with a global flag but preferably by introducing a no-op in the graph at a specific point to prevent the folding of just these nodes.

Maybe there is also a way to guarantee the storage of a particular parameter in packed form for binary and ternary weights?

gaikwadrahul8 commented 2 days ago

This issue originally reported by @BenCrulis has been moved to this dedicated repository for LiteRT to enhance issue tracking and prioritization. To ensure continuity, we have created this new issue on your behalf.

We appreciate your understanding and look forward to your continued involvement.