TF Lite NNAPI delegate doesn't work with tf.keras.layers.CenterCrop layer

Not sure if this is the correct place to report this problem, or I should report it at tensorflow/tensorflow repository.

As said here, using Keras preprocessing layers as part of our model is recommended because:

Data augmentation will run on-device, synchronously with the rest of your layers, and benefit from GPU acceleration.

When you export your model using model.save, the preprocessing layers will be saved along with the rest of your model. If you later deploy this model, it will automatically standardize images (according to the configuration of your layers). This can save you from the effort of having to reimplement that logic server-side.

Unfortunately, if you add tf.keras.layers.CenterCrop preprocessing layer to your model, you will get this error when you try to run the converted tflite model using NNAPI delegate: Internal error: Failed to apply delegate: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors.

I don't see why would CenterCrop layer cause dynamic-sized tensors if we explicitly specify inputShape, seems like a bug. Using other Keras preprocessing layers like tf.keras.layers.Rescaling work just fine.

@karimnosseir could you please take a look?

Hey Lu can you assign this to me?

Hey @markorakita , first off, are you quantizing the model? If not quantized, the NNAPI delegate might not provide much benefit.

Second, can you provide your model (even if untrained its fine) & how you are using the CenterCrop layer in your code? That will help us dig into why these dynamic shapes occur in your model.

Hey @markorakita , first off, are you quantizing the model? If not quantized, the NNAPI delegate might not provide much benefit.

Hi @srjoglekar246 , thank you for looking into this. I didn't try quantizing the model yet, I got this exception with unquantized model. Could you clarify this please? Everywhere in documentation quantization is mentioned only as an extra step to additionaly reduce model size and sometimes inference time, but I thought that every model (with supported ops) will benefit from being run on mobile GPU/DSP/NPU units.

Btw I'm sorry to say this, and this is probably not the right place for this feedback, but documentation around delegates is very poor and confusing. Tensorflow is high level API, but then all of a sudden we are left with this low level hardware decision to chose which delegate to use on which device. And to make things worse, after reading all the documentation it is still not clear to me how to make that choice. I feel like if you've already implemented the delegates you know better than us which delegate will work better on which underlying hardware. It would be great for us users to just have an option to set "automaticallyChooseDelegate" to true and not worry about it :)

Second, can you provide your model (even if untrained its fine) & how you are using the CenterCrop layer in your code? That will help us dig into why these dynamic shapes occur in your model.

Sure, here is the code I used to train the model:

image_size = (180, 180)
cropped_image_size = (160, 160)
num_channels = 3
batch_size = 32

dataset = create_dataset(train_data_path, image_size, num_channels, batch_size)

model = tf.keras.models.Sequential([
        tf.keras.layers.CenterCrop(
            height=cropped_image_size[0],
            width=cropped_image_size[1],
            input_shape=(image_size[0], image_size[1], num_channels)),
        tf.keras.layers.Rescaling(
            scale=1.0 / 255,
            input_shape=(cropped_image_size[0], cropped_image_size[1], num_channels)),
        tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(5)
    ])

model.compile(optimizer='adam',
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])

model.fit(
    dataset,
    epochs=10
)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

_createdataset function just loads flowers dataset from disk like in this tutorial. I didn't include it above since it is probably irrelevant.

As for trained model, would you like me to upload trained keras model or converted tflite model?

Hey @markorakita , first off, are you quantizing the model? If not quantized, the NNAPI delegate might not provide much benefit.

Hi @srjoglekar246 , thank you for looking into this. I didn't try quantizing the model yet, I got this exception with unquantized model. Could you clarify this please? Everywhere in documentation quantization is mentioned only as an extra step to additionaly reduce model size and sometimes inference time, but I thought that every model (with supported ops) will benefit from being run on mobile GPU/DSP/NPU units.

Btw I'm sorry to say this, and this is probably not the right place for this feedback, but documentation around delegates is very poor and confusing. Tensorflow is high level API, but then all of a sudden we are left with this low level hardware decision to chose which delegate to use on which device. And to make things worse, after reading all the documentation it is still not clear to me how to make that choice. I feel like if you've already implemented the delegates you know better than us which delegate will work better on which underlying hardware. It would be great for us users to just have an option to set "automaticallyChooseDelegate" to true and not worry about it :)

Thanks a lot for the feedback this is very useful (Please keep sending us feedback). May i ask you to expand on which part of the documentation are you referring to - There are documentation for each delegate and documentation on writing your own delegate.

Please keep the feedback coming :) Thanks

Second, can you provide your model (even if untrained its fine) & how you are using the CenterCrop layer in your code? That will help us dig into why these dynamic shapes occur in your model.

Sure, here is the code I used to train the model:

image_size = (180, 180)
cropped_image_size = (160, 160)
num_channels = 3
batch_size = 32

dataset = create_dataset(train_data_path, image_size, num_channels, batch_size)

model = tf.keras.models.Sequential([
        tf.keras.layers.CenterCrop(
            height=cropped_image_size[0],
            width=cropped_image_size[1],
            input_shape=(image_size[0], image_size[1], num_channels)),
        tf.keras.layers.Rescaling(
            scale=1.0 / 255,
            input_shape=(cropped_image_size[0], cropped_image_size[1], num_channels)),
        tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(5)
    ])

model.compile(optimizer='adam',
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])

model.fit(
    dataset,
    epochs=10
)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

_createdataset function just loads flowers dataset from disk like in this tutorial. I didn't include it above since it is probably irrelevant.

As for trained model, would you like me to upload trained keras model or converted tflite model?

I feel like if you've already implemented the delegates you know better than us which delegate will work better on which underlying hardware. It would be great for us users to just have an option to set "automaticallyChooseDelegate" to true and not worry about it :)

Its... not quite as simple as that :-). This section of our documentation explains how you can make a choice of what delegate to use based on your model & hardware, but empirically some delegates are better than others in certain situations - for eg, while NNAPI does support floating point, the GPU delegate usually has better performance for fp32 models on Android devices.

Even if you decide to use a delegate, there is a tradeoff that the user needs to understand in terms of accuracy vs performance. Because of how delegates perform internal computations, some models (like computation photography) see a non-acceptable drop in model quality if they use a delegate with certain parameters.

Moreover, "will a delegate be supported on this device" isn't a question that can be answered with 100% certainty. We have seen the GPU delegate fail on a few random devices due to libraries not being available, permissions not being granted to the application by the environment, etc.

And lastly, a user may not want to bundle all delegates into an application & leave the choice to the runtime every single time - each delegate has a corresponding binary size implication, and users don't want to add a few MBs to their APK just for a model.

That being said, we are exploring some on-device benchmarking solutions for internal apps - once we get them to a good state, we will likely offer them to external developers. Even then, its not an exact science, and will likely not be in the near future, given the heterogeneity in Android devices.

Ultimately, I do sympathize with your position :-). Hardware acceleration is a tricky thing to get right, and we have to do a better job of abstracting away the ecosystem for end users such as yourself.

but I thought that every model (with supported ops) will benefit from being run on mobile GPU/DSP/NPU units.

(Just as a quick note) This is not always true. If the number of supported operations in your model is low, the cost of copying data to/from the GPU to the CPU might actually be higher than the benefit of accelerating said operations.

About your model...

From the source code of tf.keras.layers.CenterCrop, I could not see how your input_shape kwarg is being used by the layer. Also, for fully static shapes, TFLite usually requires the Keras batch dimension to also be defined (usually set to 1). Otherwise, a batch dim of -1 causes dynamic shapes in the end model.

As for the model, you can share the TFLite model. But it might be worth specifying the batch dim as well, ensuring that the input shape is indeed well-defined for Keras & re-converting the model.

@karimnosseir @srjoglekar246 This part will be related to both of your responses:

I completely understand how complex the issue is, and that it can't be made into a cookie cutter decision. I also absolutely love it that you have left so much room for advanced users to customize and optimize how their models will be run on devices, I am all for it!

It is just that right now I am coming from a different position, and I can bet that many more people are too. My position is this -> I want to build quick and dirty MVP for some idea asap, and I have only two requirements:

I want my model to be run on something faster than CPU.
If that is not possible, I want my model to be run on CPU rather than to crash.

I feel like many users who want to try TFLite are coming from this position, and that there should be some Quick Start guide that explains how to acomplish this. I think that many people just don't have the time necessary to get a good grasp on the issue and this complexity might turn them off, and then if PyTorch Mobile has a better quick start guide they might decide to use their product instead.

@karimnosseir This page explains all the types of delegates that you support and their differences. When I first read it, I thought: "Oh great, they support all these hardware optimizations, I just need to add all those delegates and I will have everything covered!". Then after trial and error I found out that you can only choose one specific delegate to use, which is not very clear from the documentation. It is also not clear from the API, why name the function "addDelegate" instead of "setDelegate" if you can only add one delegate?

Another thing that is not clear from the documentation and that I had to find out on stackoverflow is: "If model can't be run using delegate, will execution fallback to using CPU or will it crash?".

Then here for example you have listed the ops supported on GPU delegate, there is similar page for NNAPI delegate. It is great that you have documented this, but I feel that only small amount of users can actually make use of this info. It would be so much better if you had some tool that can scan our model and tell us if we are using some unsupported ops and how.

Documentation is so rich with low level details, but I feel that many high level questions like these aren't properly answered.

@srjoglekar246 I also sympathize with your position, as I said above I understand how complex the issue is :) Nevertheless, I still think that there should be some "auto" option, no matter how naively it might be implemented in the background. Many users would rather opt to use that then to implement this painful decision process themselves.

For example I had an idea to do this:

First time my model is run on some device, use NNAPI delegate. If it crashes, never use it again on that device. If it doesn't crash, remember the execution time.
Second time my model is run on that device, use GPU delegate. If it crashes, never use it again on that device. If it doesn't crash, remember the execution time.
Same for Hexagon delegate, and for no delegate.
Afterwards, use the delegate that was working and that had the best execution time (or no delegate if that was fastest).

(Just as a quick note) This is not always true. If the number of supported operations in your model is low, the cost of copying data to/from the GPU to the CPU might actually be higher than the benefit of accelerating said operations.

Got it, that's why I said "(with supported ops)".

From the source code of tf.keras.layers.CenterCrop, I could not see how your input_shape kwarg is being used by the layer.

Can you please elaborate on this? I am setting input_shape to the layers to ensure that it will not be dynamically deduced, at least I thought it will acomplish that.

Also, for fully static shapes, TFLite usually requires the Keras batch dimension to also be defined (usually set to 1). Otherwise, a batch dim of -1 causes dynamic shapes in the end model.

I will try to set a batch dimension, but I never did that before and all my models could be run on NNAPI delegate. It is just when I add the CenterCrop layer that I get this error.

Can you please elaborate on this? I am setting input_shape to the layers to ensure that it will not be dynamically deduced, at least I thought it will acomplish that.

Can you point me to the docs saying that input_shape does that? I am not very familiar with Keras APIs, so I might be missing something :-)

tensorflow / tflite-support

TF Lite NNAPI delegate doesn't work with tf.keras.layers.CenterCrop layer #719