minghanqin / LangSplat

Official implementation of the paper "LangSplat: 3D Language Gaussian Splatting" [CVPR2024 Highlight]
https://langsplat.github.io/
Other
636 stars 63 forks source link

Can I really query the 3D gaussians? Or I can only query the rendered images? #6

Closed jeezrick closed 5 months ago

jeezrick commented 8 months ago

I have a lingering question that has been on my mind, and I was hoping you could help clarify it for me.

The focal point of the paper is "3D Scene Querying," but upon reading it, I find myself pondering whether it is feasible to query a set of 3D Gaussians.

To elaborate, let's consider a scenario where I have five million trained Gaussians representing an unfamiliar scene. My objective is to locate the position of a 'TV.'

Can I use the term 'TV' to determine the 3D spatial coordinates of the TV from this bunch of 3DGS and retrieve its corresponding image (i.e., determine the appropriate camera position for rendering)? How can you query a set of probabilistic distributions using a text prompt?

Alternatively, is my only option to query the rendered egocentric 2D image? If the TV is not present in the image, does that imply there is no means for me to ascertain the where the TV is ?

I appreciate your expertise and insights into this matter.

xxlbigbrother commented 6 months ago

@jeezrick Hi, I have the same question about how to query the 3D gaussians. Have you solved it or have any other insights? Can you share it? Thank you!

Li-Wanhua commented 6 months ago

Thank you for your attention to our work.  To achieve 3D text querying, there can be two approaches. The first method, as you mentioned, directly computes the similarity between 3D Gaussian points and text queries. The second method first renders 3D language Gaussian onto a 2D image plane using Gaussian Splatting, then computing similarity between text queries and language features on the 2D image pixels.  Previous SOTA works like LERF adopted the second method because NeRF's implicit modeling prevented the use of the first method. To ensure a fair comparison, we also employed the second method. However, our approach can indeed be tested using the first method, and we will explore it in the future to see if it yields better performance.  I believe that the first method should be feasible and have seen some 3D-GS papers that use the idea of the first method for scene editing.  I hope this explanation addresses your questions.