vobecant / POP3D

Source code for NeurIPS paper "POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images"
https://vobecant.github.io/POP3D/
78 stars 6 forks source link

Regarding the question on feature space #8

Open KarlLi5 opened 4 months ago

KarlLi5 commented 4 months ago

Dear author, I am new to this field and I have a detailed question about the methodology. For instance, works like sclip that achieve zero-shot open-vocabulary through clip generally use pamr for post-processing on the prediction results to obtain visually observable segmentation results. In your work, by learning the feature space of the image encoder from maskclip+ to a voxel space and then directly inner product with the text embedding, you are able to obtain significant prediction results. Could you explain what in the optimization process leads to this outcome?

vobecant commented 4 months ago

Dear @KarlLi5, I am not sure that I completely understand your question. What exactly do you mean by "this outcome"?

KarlLi5 commented 4 months ago

I apologize, there might have been a misunderstanding in my explanation. What I meant to say is that without the refinement of masks by PAMR, the 2D segmentation results obtained using CLIP would be poor. If these feature spaces were projected into a 3D space, theoretically, the voxel outcomes wouldn’t be impressive either. However, looking at the visualization charts from POP3D, the prediction results for common classes in the dataset are notable.

vobecant commented 4 months ago

Hi, I think that the main reason would be the quality of the MaskCLIP+ features, don't you agree?

KarlLi5 commented 4 months ago

Due to the limitations of my device, I am unable to successfully replicate your work, so I have some questions about the details. Thank you for your reply!