Thanks a lot for your work. When I read the source code, I found that the operation of the farthest point sampling is directly based on visual feature tokens. My confusion is why use visual features to compute distance instead of point clouds, is that any benefit?
Hi,
We haven't ablated this choice recently. Our idea was that background features would be similar to each other and FPS would keep only a small portion of them.
Thanks a lot for your work. When I read the source code, I found that the operation of the farthest point sampling is directly based on visual feature tokens. My confusion is why use visual features to compute distance instead of point clouds, is that any benefit?