The confusion about the details of the paper

w1oves / Rein

[CVPR 2024] Official implement of <Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation>

https://zxwei.site/rein

GNU General Public License v3.0

215 stars 19 forks source link

The confusion about the details of the paper #43

Closed seabearlmx closed 3 months ago

seabearlmx commented 3 months ago

Dear author, I have some confusion about the details of the paper as follows:

Rein can apply a softmax function to align each patch with a unique instance, why can a softmax function achieve this goal?
This strategic selection allows models to sidestep unnecessary adjustments by assigning a high value to the first token and subsequently discarding it. I am confused about why the first token is assigned a high value and how to achieve the goal that assigns a high value to the first token rather than another token.

w1oves commented 3 months ago

I apologize for any confusion. It may be inaccurate to say that softmax can align each patch with a unique instance. It is more accurate to state that softmax ensures each patch does not link to multiple instances. This is because, as explained in Section 3.3, a token is linked to an instance, and softmax suppresses non-maximum values.

w1oves commented 3 months ago

Removing any column can achieve a result similar to removing the first column; I merely chose the first column as it was the easiest option. The key is to remove one column so that the total sum of the similarity matrix S does not equal 1. This idea was inspired by Vision Transformers Need Registers. If you still have any further questions, please feel free to ask.

seabearlmx commented 2 months ago

Thanks for your reply. The results of Table 3 caught my attention since some performances of Noise in Cityscapes to Cityscapes-C are lower than HGFormer. Could you give a discussion what is the reason for this phenomenon?

Moreover, how to optimize the learnable token to make them link to each instance? Is instead of the learnable token to the original token for predicting masks to achieve this goal? Does it make sense to improve the optimization strategy or the learning strategy for the learnable token？

w1oves commented 1 month ago

Sorry, I'm not sure why REIN performs particularly poorly when facing noisy data. I haven't conducted specific experiments to determine whether this result is due to the nature of DinoV2 or the PEFT methods like REIN.
Linking learnable tokens to each instance is achieved by connecting the tokens to the queries in Mask2Former.
Further improving the optimization method of REIN is meaningful, and I understand that related work is currently underway.