plemeri / InSPyReNet

Official PyTorch implementation of Revisiting Image Pyramid Structure for High Resolution Salient Object Detection (ACCV 2022)
MIT License
321 stars 61 forks source link

`base_size` and `stage` parameters are not used for encoder and decoder #35

Closed kuynzereb closed 9 months ago

kuynzereb commented 10 months ago

Hi, thank you for a great work!

I was digging into the used NN architecture and found out that base_size and stage parameters are not actually used inside PAA_e and PAA_d modules. They are used to initialize stage_size parameter which is then used in SelfAttention module. And the thing is that SelfAttention does not actually use this parameter.

I am just wondering, is it some sort of a bug and maybe the whole model may be further improved or these parameters simply should be not used in these modules?

kuynzereb commented 10 months ago

SelfAttention itself also looks suspicious. If input tensor is a square (i.e., H equals W), then mode='h' and mode='w' will lead to the very same output. These two modes are used in PAA_kernel and PAA_d modules, so I guess they are expected to produce different outputs.

Amadeus-AI commented 10 months ago

Interesting, if you have time I suggest you can dig into https://github.com/plemeri/UACANet, which is where PAA_e and PAA_d from.

plemeri commented 10 months ago

Hi, thanks for paying attention to our work. We initially tried to solve a image resolution problem for Object Contextual Representation architecture which we have been used for the attention-based segmentation decoder and thought it might be useful for the other attention operations such as PAA modules but empirically we found that it wasn't necessary. Those parameters are deprecated and no longer in use.

kuynzereb commented 10 months ago

Got it, thank you!

Though the question about mode parameter in SelfAttention still remains. It doesn’t seem correct that mode=“h” and mode=“w” produce the same results for square images. Looks like this mode parameter is used incorrectly.

Now the transformation applied to feature tensors looks like

axis = 1
if "h" in self.mode:
    axis *= height
if "w" in self.mode:
    axis *= width

view = (batch_size, -1, axis)

x = x.view(*view)

It is correct for mode="w" but seems to be incorrect for mode="h". If I understand correctly, in that case it rather should be something like

x = x.permute(0, 1, 3, 2).reshape(*view)  # view = (batch_size, -1, height)

That is, we should additionally transpose H and W dimensions.

plemeri commented 9 months ago

You are once again right about the problem. This was actually came from our previous work UACANet - issue #8 but we do not found any meaningful performance difference at that time, so we didn't change at that moment.

If you're trying to do a research or developing your own work, you can either change it into the right source code or just leave it as is.

kuynzereb commented 9 months ago

Ok, thanks!