A little question about dimensions

Thank you for such a great job. After reading the code, I have a little question about the comments and I hope you can help me! In the following code, the dimensions of visual_context are [B, N, C] and the dimensions of text_embeddings are [B, K, C]. Should the dimensions after context_decoder be [B, K, C]?

To facilitate your reading, I paste the context_decoder forward code below

Thanks again for your reply

raoyongming / DenseCLIP

A little question about dimensions #54