microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.66k stars 225 forks source link

About The Position Information #210

Closed Zhong1015 closed 9 months ago

Zhong1015 commented 10 months ago

Hi,@wkcn.Your work has been truly inspiring, and I have applied your modules to my model, yielding some promising results. However, I have a query. Suppose I haven't segmented the image like Deit or implemented absolute position encoding within the Transformer as in DETR. Without any method for obtaining positional information, if the image features are flattened and fed into the Transformer encoder, how does your 2D positional encoding work, or does it function at all? Because your method is designed for 2D image features, how would Euclidean distance or other distance calculations be achieved without positional information?

wkcn commented 10 months ago

Hi @Zhong1015 ,

  1. how does your 2D positional encoding work?

RPE can capture local information like convolution, as discussed in Section 4.5 of our paper. Even if there is no absolute position encoding, RPE still improve the performance. However, absolute position encoding is necessary for object detection, as shown in Table 6.

  1. how would Euclidean distance or other distance calculations be achieved without positional information?

Euclidean distance can also capture local information in images.

Zhong1015 commented 10 months ago

Hi,@wkcn.Thank you very much for responding to my question. Here, I need to briefly explain my work and hope you can provide some advice. I first extract image features using a CNN-based backbone model. These image features are then flattened and incorporated into a traditional standard Transformer structure (not a visual transformer). Since I am working on multi-label image classification, the input in the decoder part consists of label embeddings, so I did not use your iRPE. For the encoder part, the input comprises image features obtained after CNN convolution, but I didn't apply any positional encoding. Subsequently, I applied your iRPE on qkv. Is this approach reasonable? If it is, could you please explain how you perform two-dimensional computations on flattened data? Once again, thank you for paying attention to my question. Your response is immensely helpful to me!

Zhong1015 commented 10 months ago

Hi,@wkcn.Thank you very much for responding to my question. Here, I need to briefly explain my work and hope you can provide some advice. I first extract image features using a CNN-based backbone model. These image features are then flattened and incorporated into a traditional standard Transformer structure (not a visual transformer). Since I am working on multi-label image classification, the input in the decoder part consists of label embeddings, so I did not use your iRPE. For the encoder part, the input comprises image features obtained after CNN convolution, but I didn't apply any positional encoding. Subsequently, I applied your iRPE on qkv. Is this approach reasonable? If it is, could you please explain how you perform two-dimensional computations on flattened data? Once again, thank you for paying attention to my question. Your response is immensely helpful to me!

I also did not perform partitioning operations on the image similar to Deit

wkcn commented 10 months ago

@Zhong1015

  1. the input in the decoder part consists of label embeddings, so I did not use your iRPE. For the encoder part

The decoder can be integrated with iRPE. There is an argument named skip, which skips the label embedding.

Example: https://github.com/microsoft/Cream/blob/main/iRPE/DeiT-with-iRPE/rpe_models.py#L74

  1. For the encoder part, the input comprises image features obtained after CNN convolution, but I didn't apply any positional encoding. Subsequently, I applied your iRPE on qkv.

It is reasonable. CNN injects relative and absolute positional encodings, so in this case the transformer can be not equipped with (absolute or relative) positional encoding.

  1. how you perform two-dimensional computations on flattened data?

For the transformer model like DeiT/ViT-16, the sequence length is 1 + 14 14 = 197, where 1 is the class embedding, 1414 is the number of patch embedding. We unflatten the patch embedding to a 2D tensor with the shape of 14 x 14, then compute 2D RPE on it.

In the implementation of iRPE, we compute the 2D positional indices pos and the relative position diff = pos1 - pos2.

Code: https://github.com/microsoft/Cream/blob/main/iRPE/DeiT-with-iRPE/irpe.py#L342

Zhong1015 commented 10 months ago

Hi,@wkcn.Thank you for your reply.For the three points you mentioned, I have understood them. However, regarding the second and third points, what I don't comprehend is this: You mentioned that after obtaining image features through CNN-like convolutions, I can directly apply iRPE on qkv in the Transformer. But I haven't performed explicit blocking or absolute positional encoding. Consequently, it should be impossible to calculate specific coordinates. Besides, I'm using a standard Transformer structure, which doesn't provide absolute positional encoding information like DETR or perform image blocking like DeiT. In such a scenario, can I still use iRPE to add relative positional encoding and carry out a series of computations?

Zhong1015 commented 10 months ago

Hi,@wkcn.Thank you for your reply.For the three points you mentioned, I have understood them. However, regarding the second and third points, what I don't comprehend is this: You mentioned that after obtaining image features through CNN-like convolutions, I can directly apply iRPE on qkv in the Transformer. But I haven't performed explicit blocking or absolute positional encoding. Consequently, it should be impossible to calculate specific coordinates. Besides, I'm using a standard Transformer structure, which doesn't provide absolute positional encoding information like DETR or perform image blocking like DeiT. In such a scenario, can I still use iRPE to add relative positional encoding and carry out a series of computations? For the first point, in the input section of the decoder, my qkv comprises learnable label embeddings rather than image information. Therefore, I believe iRPE (image Relative Positional Encoding) is not necessary in this context.

wkcn commented 10 months ago

@Zhong1015 Yes, but I suggest to add the absolute positional encoding too.

Zhong1015 commented 10 months ago

@Zhong1015 Yes, but I suggest to add the absolute positional encoding too. Thank you for your reply.Are you suggesting that it is correct to flatten the image features obtained after convolution without any partitioning or absolute positional encoding, and subsequently pass these flattened image features through a standard Transformer encoder while applying IRPE (Independent Relative Positional Encoding) on the query, key, and value (qkv)?

wkcn commented 10 months ago

@Zhong1015

  1. The performance of the model with only iRPE is better than that of the one without any positional encoding.
  2. I am not sure how much impact the absolute positional encoding brings in this case. I think absolute positional encoding is necessary.