shape of spatial-temporal tokens X

isyangshu / Surgformer

[MICCAI 2024] Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition

Other

21 stars 5 forks source link

shape of spatial-temporal tokens X #5

Open cascat0 opened 2 days ago

cascat0 commented 2 days ago

Hello. I'm a little confused about the section 2.1 in the paper.

The shape of input frame volume V is T ×C × H × W. The num of channels is C.

But the shape of spatial-temporal tokens X is T × K × C. The num of channels is still C.

I wonder if there should be some other number here (such as D), other than C.

isyangshu commented 1 day ago

Well, sorry for the confusion. Firstly, we can get the frame volume V with size T3HW, and then we use patch embedding to get the spatial-temporal tokens: TH'W'D -> TKD. So here we'd better to use D instead of C. Thanks for your question.

cascat0 commented 1 day ago

Well, sorry for the confusion. Firstly, we can get the frame volume V with size T_3_H_W, and then we use patch embedding to get the spatial-temporal tokens: T_H'_W'_D -> T_K_D. So here we'd better to use D instead of C. Thanks for your question.

Thanks for your answer. Temporal and spatial position embeddings are added in your code, but they were not mentioned in your paper. What is the benefit of adding those embeddings?