Question about the qualitative results on ScanNet200 dataset

ngoductuanlhp commented 11 months ago

Hi @shiyoung77,

Thank you for your amazing work and the published code. I've encountered an issue while using your method on ScanNet200. Specifically, when visualizing all the generated instances in a scene (scene0011_00 in the validation split), the result appears as follows: I've observed that many objects seem to be fragmented into multiple parts, which is considerably worse than in your paper. I'm wondering if there might be something missing in the implementation.

Thank you and looking forward to your reply.

PhucNDA commented 11 months ago

Hi @shiyoung77 , I also encounter the similar problem. The object seem to be fragmented into parts...

shiyoung77 commented 11 months ago

I'm testing it on my end. Something seems to be wrong. I'll get back to you soon.

shiyoung77 commented 11 months ago

There was indeed a bug during the filtering process, which I fixed just now. Please pull the repo again and sorry for the inconvenience. The results should be much more reasonable now, despite not being perfect (our overall instance retrieval mAP is ~0.21. I believe there is plenty of room for improvement). I reran the results for ScanNet200 and you can directly download the result files from here.

I share some examples from the first few scenes below. You should be able to reproduce them using the latest code (Both Detic and OVIR-3D). The "table" in "scene0011_00" is a failure case, but the segmentation for other objects, such as chairs and fridges, is much improved. One thing to note here is that OVIR-3D allows multi-label for each point in the open-vocabulary setup. For example, part of a "chair" could also belong to "cushion". Because of this, sometimes you may see an object (chair) is over-segmented, but when you query a "chair", you can see that it is segmented as a whole.

Let me know if there are other issues. Thanks for your interest!

example

PhucNDA commented 11 months ago

Thanks for the clarification, have you tested OVIR-3D on fullset of Scannet200 including 198 classes?

shiyoung77 commented 11 months ago

@PhucNDA You mean the training set? No, I didn't. It would take quite some time to run given that it's 4x larger than the validation set. But feel free to give it a try. I think the quality would be similar. Nevertheless, I do plan to test on the new ScanNet++ dataset when I have time, which contains 280 (230 + 50) high-quality scans and better annotations.

PhucNDA commented 11 months ago

The Validation Set (312 scenes) with 198 classes I meant

shiyoung77 commented 11 months ago

Yes, I tested on the whole validation set (312 scenes). You can find the result file for each scan here. The exact number of categories for the validation set is slightly less than 198 if I remember correctly (some categories only appeared in the training set).

ngoductuanlhp commented 11 months ago

Thank you for sharing this. Could you provide the script to evaluate the results on the Scannet200 benchmark (as shown in tab 1 and 2 in your paper)?

shiyoung77 commented 11 months ago

Sure, I'll push the evaluation scripts very soon (hopefully by tomorrow). Stay tuned!

shiyoung77 commented 11 months ago

Evaluation script for ScanNet200 pushed. You may see a tiny difference (<0.01 in terms of overall mAP ) given the reproduced results compared with the table in the paper.

Just to remind you of two things that we explicitly wrote in the paper (Sections 5.1 & 5.2): 1) Only annotated object categories in a 3D scene are used as text queries for evaluation. The retrieval mAP results were computed for each 3D scene and then averaged for the whole dataset. 2) Uncountable categories ”floor”, ”wall”, and ”ceiling” and their subcategories are not evaluated.

If you believe this is not the correct way of evaluation, feel free to use your own metric. Let me know if you still have questions.

PhucNDA commented 11 months ago

Thank you

PhucNDA commented 11 months ago

Hi @shiyoung77, I see that your method uses different eval method compared to other OV methods. "Only annotated object categories in a 3D scene are used as text queries for evaluation" seems strange. Have you tested it on full set of text categories in Scannet200? For example in evalscannet200, on line 157, to decide label for each pred mask instead of performing matmul ops between pred_features and cat_features of GT classes on each scene like your method

cat_feature = cat_feature in scene.....
similarities = pred_features @ cat_feature

you perform matmul ops between pred_features and full set of global cat_features

cat_feature = cat_feature in VALID_CLASS_NAME # your valid set
similarities = pred_features @ cat_feature

Will there be any performance drop ? Looking forward to your respond. Thank you.

shiyoung77 commented 11 months ago

Well, I guess we have different definitions of this task.

I view this as an information retrieval problem, i.e. Given a language query, retrieve relevant documents (ranked instances) from a database (a scene). That's why I call it instance retrieval. We are using the standard mAP metric that's written in the textbook (also ref). There is no need to decide label for each pred mask because the label is your query, and our method provides relevant instances given this query (label). You can query anything that you want in a scene (for example all val categories), but you can only evaluate things that have ground truth annotations. The AP for each category is independent and will be the same no matter how many categories you evaluate.

You view this as a closed-set segmentation problem, i.e. detecting and segmenting a predefined set of categories in a scene. That's why you need to decide label for each pred mask likely using Softmax. However, there are some caveats 1) Our method does not know ScanNet categories during the fusion process, and can find instances that do not belong to the 200 categories of ScanNet (e.g. lampshade) 2) The AP result for each category will depend on the total categories used for evaluation because of Softmax. The more categories you have, the less AP you will get. These are not problems for closed-set detection problems because categories are fixed and known in advance, but I would say it's sort of problematic in the open-vocabulary setup.

I personally believe that the evaluation process should be different for closed-set and open-set problems. However, the best metric remains to be debated. Feel free to use your favorite metric given your task definition, and clearly write that in your paper. You have my result files, so it should be easy to do it.

shiyoung77 / OVIR-3D

Question about the qualitative results on ScanNet200 dataset #4