keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras
Other
1.01k stars 330 forks source link

AdaptiveAvgPool2D layer #512

Closed Rocketknight1 closed 1 year ago

Rocketknight1 commented 2 years ago

Short Description At Hugging Face we've seen a few PyTorch vision transformer models using AdaptiveAvgPool2D. In a lot of cases these are just resizing to (1,) or (1, 1), in which case they're just a strange way to compute torch.mean(), but in some cases they're actually using this layer's full functionality.

The problem with AdaptiveAvgPool2D is that it computes the pooling windows in a unique way, and the size of the windows can be variable. This makes it impossible to implement with a standard pooling layer, and very annoying to port to TF, especially if you want to load weights from a model that was trained with the Torch layer. There is an implementation in tensorflow-addons but it uses fixed size windows, and so does not match the output of the Torch layer unless the input size is an integer multiple of the output size.

We made a reasonably performant TF version that does correctly match the Torch layer in all cases - do you need this in keras-cv? We'd be happy to add it as a PR if it's useful.

Existing Implementations Torch layer

Other information See this StackOverflow post for a good description of how the Torch layer works internally.

LukeWood commented 2 years ago

Really interesting, is this layer required to get strong performing in vision transformer models?

I see no issue in including it, but I want to be sure it fits a use case. Let me know if you have more information as to "why" it is used.

Rocketknight1 commented 2 years ago

I suspect that the layer is never really necessary when you're designing a model from scratch - you could always just use normal pooling layers and just choose the strides and widths appropriately. The need for it in TF mostly arises when someone has trained a model in PyTorch and you want to reimplement their model and load their weights - then if you write a slightly different pooling layer you'll probably break compatibility.

It's commonly used in pyramid pooling modules (paper, >8000 citations), e.g. in BeiT (>200 citations, code sample) and Data2Vec. There is also a tensorflow-addons port of that module, but as it depends on the TFA implementation of adaptive pooling, results do not match Torch pyramid pooling modules.

Rocketknight1 commented 2 years ago

That said, it's totally okay if you want to leave it out for now - I linked it in the gist above, so feel free to just close this for now and add the layer later if/when you find out you need it for a model.

LukeWood commented 2 years ago

That said, it's totally okay if you want to leave it out for now - I linked it in the gist above, so feel free to just close this for now and add the layer later if/when you find out you need it for a model.

I think this sounds like a good addition, I'm mainly just curious if the only reason anyone uses it is backwards compatibility. Do people continue to use it a lot in the pytorch world today?

LukeWood commented 2 years ago

Do I understand correctly that this layer at train/inference always uses the same size strides/width, but that this layer is a way of auto-selecting these values?

Rocketknight1 commented 2 years ago

Yes, that's correct. However, it doesn't exactly have a 'stride' in the usual sense. It basically splats (potentially overlapping) pooling windows all across the input so as to get the desired output shape, but the spacing between these windows is usually not constant, unless the input size is an integer multiple of the output size. The width of the windows can also vary in different locations.

LukeWood commented 2 years ago

I see, can you paste a snipped of the API we would want to use? Do you specify output shape?

Rocketknight1 commented 2 years ago

Sure, the API is just:

layer = AdaptiveAvgPool2D(output_dims=(128, 128))

# Can also support NCHW, but we use NHWC here
inputs = tf.ones((8, 192, 192, 3), dtype=tf.float32)
outputs = layer(inputs)

# outputs.shape is (8, 128, 128, 3)

In other words, you specify the output shape at init, and then whatever Tensor you pass in gets pooled down to the desired output shape, with the required pooling windows being calculated in the call(). The same layer can handle multiple different input shapes without needing to be rebuilt.

LukeWood commented 2 years ago

Gotcha, that is a pretty interesting feature.

One last quick Q, does it handle upscaling too?

layer = AdaptiveAvgPool2D(output_dims=(128, 128))
# Can also support NCHW, but we use NHWC here
inputs = tf.ones((8, 192, 192, 3), dtype=tf.float32)
outputs = layer(inputs)
# outputs.shape is (8, 128, 128, 3)

upscaled =  AdaptiveAvgPool2D(output_dims=(256, 256))(outputs)
# upscaled is (8, 256, 256, 3)
Rocketknight1 commented 2 years ago

In Torch it might (though I don't think this is a common/intended use case), but because I implemented it using normal pooling layers in TF I don't think it would work in my implementation. If needed, I could add a different code path for upscaling, but I haven't seen any code in the wild where people use it for that.

LukeWood commented 2 years ago

for sure, thanks.

My only real concern here is that this isn't idiomatic to Keras. In pytorch you specify output features all the time, but in Keras that is computed for you. So it is a little strange to have this in Keras, BUT I will say for compatibility purposes it could be valuable.

Rocketknight1 commented 2 years ago

Yeah, I'd say it's mostly (entirely?) useful for PyTorch model compatibility, so I get that it might feel out of place. But still, let us know if you want it anyway, and we'll make a PR!

LukeWood commented 2 years ago

thanks for the offer!

I'd like to hear @fchollet and @tanzhenyu 's opinion on compatibility layers like this. Our long term goal is to not port weights, but perhaps this is low enough cost that the benefit outweights the cost.

innat commented 2 years ago

@Rocketknight1 Do you also implement 1D and 3D version of this layer?

Rocketknight1 commented 2 years ago

Hi @innat I could, yes! If you look at the gist I linked above, the pseudo_1d_pool function is basically just a 1D AdaptivePool, so that would be very easy to implement as a separate layer. To do a 3D pool I would just do a 1D pool on each of the 3 dimensions.

LukeWood commented 1 year ago

Closing this until we have a strong use case!

srasoulzadeh commented 3 months ago

AdaptiveAveragePoolingxD (1D, 2D, 3D) used to exist in "TensorFlow Addons." However, with TensorFlow Addons no longer being supported, it would be beneficial to include it in KerasCV. For a strong use case, I can refer you to its application in Tri-Plane representation, which is being increasingly used in 3D Generative AI literature. One example is the recent work by Wu, Learning to Generate 3D Shapes from a Single Example.