Open KarlLi5 opened 4 months ago
Dear @KarlLi5, I am not sure that I completely understand your question. What exactly do you mean by "this outcome"?
I apologize, there might have been a misunderstanding in my explanation. What I meant to say is that without the refinement of masks by PAMR, the 2D segmentation results obtained using CLIP would be poor. If these feature spaces were projected into a 3D space, theoretically, the voxel outcomes wouldn’t be impressive either. However, looking at the visualization charts from POP3D, the prediction results for common classes in the dataset are notable.
Hi, I think that the main reason would be the quality of the MaskCLIP+ features, don't you agree?
Due to the limitations of my device, I am unable to successfully replicate your work, so I have some questions about the details. Thank you for your reply!
Dear author, I am new to this field and I have a detailed question about the methodology. For instance, works like sclip that achieve zero-shot open-vocabulary through clip generally use pamr for post-processing on the prediction results to obtain visually observable segmentation results. In your work, by learning the feature space of the image encoder from maskclip+ to a voxel space and then directly inner product with the text embedding, you are able to obtain significant prediction results. Could you explain what in the optimization process leads to this outcome?