nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

About the arguments of butd, butd_cls, butd_gt #11

Closed Hiusam closed 1 year ago

Hiusam commented 1 year ago

Hi, I want to make sure I understand some of the arguments correctly.

args.butd: use the box stream or not. True: use the box stream
args.butd_cls == True: use the gt boxes provided by ReferIt3D but no class label. This corresponds to the "GT" setting in the paper.
args.butd_gt == True: use the gt boxes with class labels. The result is not included in the paper.
args.butd_cls == False and args.butd_gt == False: use the gt boxes obtained by the Group Free detector (without class label? I am not sure :( ). This corresponds to the "DET" setting in the paper.

Am I correct?

I have another question. In your codes of Joint3DDataset:

        if self.butd_gt:
            all_detected_bboxes = all_bboxes
            all_detected_bbox_label_mask = all_bbox_label_mask
            detected_class_ids = class_ids

        # Assume a perfect object proposal stage
        if self.butd_cls:
            all_detected_bboxes = all_bboxes
            all_detected_bbox_label_mask = all_bbox_label_mask
            detected_class_ids = np.zeros((len(all_bboxes,)))
            classes = np.array(self.cls_results[anno['scan_id']])
            # detected_class_ids[all_bbox_label_mask] = classes[classes > -1]
            classes[classes == -1] = 325  # 'object' class
            _k = all_bbox_label_mask.sum()
            detected_class_ids[:_k] = classes[:_k]

What is the difference of classes and class_ids? I find that most of the valid labels of them are the same, but some are different. It seems that in both self.butid_cls and self.butd_gt setting, class labels are used?

I find it difficult to figure out the setting here.

Thanks!

ayushjain1144 commented 1 year ago

Hi, yes your understanding is correct. And yes, we also uses classes from group free detector in DET setting. In butd_cls, the boxes are ground truth and classes come from a pointnet++ classifier (like prior works eg. referit3D)

Class_ids are ground truth; while classes are loaded from a json where we stored the predictions of the pointnet++ classifier.

Let us know if something is still unclear!

Hiusam commented 1 year ago

Hi thank you for your quick reply. In your paper

All existing models ... They all use ground-truth 3D object boxes (without category labels) as the set of boxes to select from. We thus consider two evaluation setups

So the results in Table 1 are obtained with category labels?

ayushjain1144 commented 1 year ago

Hi thank you for your quick reply. In your paper

All existing models ... They all use ground-truth 3D object boxes (without category labels) as the set of boxes to select from. We thus consider two evaluation setups

So the results in Table 1 are obtained with category labels?

Oh sorry for the confusion, in the paper we meant to say without assuming "ground-truth" category labels at test time.

Hiusam commented 1 year ago

Hi thank you for your quick reply. In your paper

All existing models ... They all use ground-truth 3D object boxes (without category labels) as the set of boxes to select from. We thus consider two evaluation setups

So the results in Table 1 are obtained with category labels?

Oh sorry for the confusion, in the paper we meant to say without assuming "ground-truth" category labels at test time.

Sorry, I am still confused. It seems that in your codes, the test_dataset and train_dataset have the same configuration for using "classes" or "classes_id". What do you mean by test time?

Let's make it clear first: in your method, as long as you use box_stream, you use class labels of the boxes as well. Differences are:

1. args.butd_gt == True: use GT boxes and GT labels. 
2. args.butd_cls == True: use GT boxes and labels from PointNet++ classifier.
3. args.butd_gt == False and args.butd_cls == False: use boxes and labels from the Group-free detector.

So in your paper, GT setting corresponds to 1 or 2? and DET setting corresponds to 3, right?
ayushjain1144 commented 1 year ago

"in your method, as long as you use box_stream, you use class labels of the boxes as well" <- yes

"GT setting corresponds to 1 or 2" <- It corresponds to 2 i.e. GT boxes and labels from pointnet++ classifier. We used "1" just for some early experiments.

Yes, as you said, training and testing have the same configuration. I see how "test-time" was confusing, you can safely ignore it: I basically meant that prior works assume access to ground-truth boxes only and not category labels when they do inference.

Happy to clarify more if something is still unclear!

Hiusam commented 1 year ago

Great, thank you for your patience.

I wonder why using GT boxes with labels from the pointnet++ classifier instead of GT labels? With GT boxes and GT labels, we can independently figure out the spatial reasoning ability of the model.

nickgkan commented 1 year ago

This is the standard protocol per the original ReferIt3D paper, i.e., assuming access to oracle box proposals but not oracle classes. Both what you're describing (corresponding to butd_gt in our code) and what ReferIt3D uses (butd_cls) are good for diagnosis, but in the real world there's no oracle detector or classifier, so the standard butd mode is the closest one to reality.

Hiusam commented 1 year ago

Yes, you are right. Thanks for answering.

Hiusam commented 1 year ago

This is the standard protocol per the original ReferIt3D paper, i.e., assuming access to oracle box proposals but not oracle classes. Both what you're describing (corresponding to butd_gt in our code) and what ReferIt3D uses (butd_cls) are good for diagnosis, but in the real world there's no oracle detector or classifier, so the standard butd mode is the closest one to reality.

Hi, I just had an experiment with accessing oracle boxes and oracle class prediction (butd_gt == True, SR3D). The accuracy is 90%+! Do you have similar results? And can we draw a conclusion that the 3D Visual Grounding can be easily solved with sufficiently good detectors? What's your opinion?

nickgkan commented 1 year ago

Yes we have observed the same result on SR3D. This shows that object detection per se is a significant bottleneck for 3D language grounding. Our paper shows that you can improve the range of objects you can detect if you exploit the language cues. In the real world it's hard to assume a perfect detector so being able to improve over it is important.

Other than that, it's possible that SR3D is not general enough to draw conclusions about 3D grounding in general. We can say that for this dataset the major bottleneck is object detection.

WeitaiKang commented 7 months ago

Hi, I have a following question here for butd_cls mode.

In Link, you get the detected_class_ids from 'data/cls_results.json', which you mentioned in other issues that it is the predicted class result from pointnet++(pretrained on ScanNet).

However, in the standard mode, where butd_cls = False and butd_gt = False, your detected_class_ids comes from Link. And it is also the predicted class result from pointnet++(pretrained on ScanNet), but with additional DC.nyu40id2class mapping.

I have checked both of the above two detected_class_ids and find that they are difference. Do i miss something here?

ps: I remove the corrupt Link to avoid mistake.