Open cascat0 opened 2 days ago
Well, sorry for the confusion. Firstly, we can get the frame volume V with size T3HW, and then we use patch embedding to get the spatial-temporal tokens: TH'W'D -> TKD. So here we'd better to use D instead of C. Thanks for your question.
Well, sorry for the confusion. Firstly, we can get the frame volume V with size T_3_H_W, and then we use patch embedding to get the spatial-temporal tokens: T_H'_W'_D -> T_K_D. So here we'd better to use D instead of C. Thanks for your question.
Thanks for your answer. Temporal and spatial position embeddings are added in your code, but they were not mentioned in your paper. What is the benefit of adding those embeddings?
Hello. I'm a little confused about the section 2.1 in the paper.
The shape of input frame volume V is T ×C × H × W. The num of channels is C.
But the shape of spatial-temporal tokens X is T × K × C. The num of channels is still C.
I wonder if there should be some other number here (such as D), other than C.