runnanchen / CLIP2Scene

Other
152 stars 6 forks source link

Question about Semantic-guided Spatial-temporal Consistency Regularization #10

Open fang196 opened 1 year ago

fang196 commented 1 year ago

Thanks for the great work! I have three questions about Semantic-guided Spatial-temporal Consistency Regularization.

  1. What is the reason for dividing the complete stitched point cloud into regular grids rather than using short-term temporality directly?
  2. What does the symbol * represent in Equation 3? Does it indicate a cross product operation?
  3. It is stated that the image is matched to the first frame of the point cloud $P_1$ using pixel-point correspondences ${\hat{x}_i^1, \hat{p}i^1}{i=1}^{\hat{M}}$. This implies that for values of $k$ ranging from 1 to $K$, we have $t{\hat{i}}^k = t{\hat{i}}^1$ and $\hat{x}{\hat{i}}^k = \hat{x}{\hat{i}}^1$. However, in Equation 4, the text embeddings are denoted as $t{\hat{i}}^1$, while the image embeddings are denoted as $\hat{x}{\hat{i}}^{\hat{k}}$. Why is this the case?