huawei-noah / Pretrained-IPT

Apache License 2.0
433 stars 63 forks source link

Questions about position encoding for a larger sequence length #5

Open pp00704831 opened 3 years ago

pp00704831 commented 3 years ago

Hello, In your paper, you crop an image into 48x48 patches with 3 channels followed by heads. Before features into the transformer, each feature is separated into a patch with kernel size 3 as a word to generate the 16x16 sequence (tokens).
How will you do if we input the larger sequence such as 32x32 (tokens) for the position encoding?

Looking forward to your reply, thank you!

HantingChen commented 3 years ago

What is the meaning of inputting large sequance for position encoding?

As the position encoding is added to the input patches, their size should be exactly same as the patches (16*16)

pp00704831 commented 3 years ago

Hello,

For image deblurring task , you use patch size as 256x256 with patch dim 8, thus the numbers of tokens are 32x32. But your pre-trained model is trained on size 48x48 with patch dim 3 , thus the numbers of tokens are 16x16. There might be a mismatch for the position encoding with pre-trained model. Do you use interpolation from 16x16 to 32x32?

Thank you!