Tokens used for regressino & queries

Hi Divadi,

Thanks for your interest in our work. This is an interesting question.

Since the self-attention layers combine information globally (from all tokens to all tokens), there's actually no natural correspondence between the input tokens and the output tokens of the transformer. Technically, you can treat all tokens after the first layer as however you interpret (based on the loss or later usage) (Note: First layer output has a weak 'query'-based correspondence as the "queries" are predicted from the corresponding tokens).

If you have a reason to distinguish between patch tokens and one token that represents the global information, you just add a dummy [CLS] token and use that as a "global" token instead. This is the case, for example, in language models where you really need to have a direct correspondence between input and output tokens. In our case, however, we don't need any such correspondence so the transformer is free to pool any kind of information in any of the tokens without discrimination or any bias towards corresponding patches.

With that said, adding extra 129 dummy input tokens might actually slightly increase the performance purely because of longer sequence length and more representation power but will also increase the memory footprint.

Hope that gives you some insights to the design choice in question.

shariqfarooq123 / AdaBins

Tokens used for regressino & queries #69