Question about Semantic-guided Spatial-temporal Consistency Regularization

Thanks for the great work! I have three questions about Semantic-guided Spatial-temporal Consistency Regularization.

What is the reason for dividing the complete stitched point cloud into regular grids rather than using short-term temporality directly?
What does the symbol * represent in Equation 3? Does it indicate a cross product operation?
It is stated that the image is matched to the first frame of the point cloud $P_1$ using pixel-point correspondences ${\hat{x}_i^1, \hat{p}i^1}{i=1}^{\hat{M}}$. This implies that for values of $k$ ranging from 1 to $K$, we have $t{\hat{i}}^k = t{\hat{i}}^1$ and $\hat{x}{\hat{i}}^k = \hat{x}{\hat{i}}^1$. However, in Equation 4, the text embeddings are denoted as $t{\hat{i}}^1$, while the image embeddings are denoted as $\hat{x}{\hat{i}}^{\hat{k}}$. Why is this the case?

runnanchen / CLIP2Scene