Closed SnaKey0u0 closed 10 months ago
nding is that these should encode positional information using Fourier Feature Mapping, as they pass through the get_img_pe and _pe_encoding functions.
However, these information are not used in the self-attention and cross-attention of the TwoWayAttentionBlock.
Could you help clarify if I’ve overlooked something? I appreciate your assistance and look forward to your response.
Hi, you are right, we did not use the positional information. The original SAM uses this Fourier Feature mapping to ensure the point embedding is similar to image embedding. However, we directly interpolate from the image embedding to obtain the point embedding. This can avoid over-smoothing caused by a larger amount of tokens and also make the prompt embedding focus more on semantic information. We have tested adding positional encoding in our framework but with no improvements.
Hi, thanks for the great work. I’ve been examining the prompt_encoder.py file and noticed that the img_pe and point_pe variables appear to be unused.
My understanding is that these should encode positional information using Fourier Feature Mapping, as they pass through the get_img_pe and _pe_encoding functions.
However, these information are not used in the self-attention and cross-attention of the TwoWayAttentionBlock.
Could you help clarify if I’ve overlooked something? I appreciate your assistance and look forward to your response.