nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Regarding Object Detection on ScanNet #20

Closed soham-joshi closed 1 year ago

soham-joshi commented 1 year ago

Hey @ayushjain1144 @nickgkan , I wanted to confirm a query: For the ScanNet dataset (only), the training and testing tasks are object detection. The target_ids will be the objects from the DC18 vocab. But according to the src/joint_det_dataset.py line 761 every time only the first object is passed as the target. Sometimes the utterance created (concatenation of object names) might be irrelevant for items not occurring in DC18. (for example in the scene 'scene0405_00', objects like 'trash can', and utterance like 'cabinet . bed . chair . couch . table . door . window . bookshelf . picture . counter . desk . curtain . refrigerator . shower curtain . toilet . sink . bathtub . other furniture') Could you confirm if I am correct? I have the following queries also:

  1. Can we extend the object detection to detect two or more objects?
  2. Further, I wanted to do an object detection experiment on DC (instead of DC18), could you give a clue on how shall I do the same? Thanks!
soham-joshi commented 1 year ago

Another related query: With --joint_det and random_utt, (datasets are scannet and sr3d) the algorithm creates a random utterance (with the help of sampled classes): 'smoke detector . alarm . kinect . shower curtain . printer . wall . crate . bowl . seat . door . garbage bag . cable . toilet seat cover dispenser . bathroom cabinet . light switch . soap dispenser . bar . floor . pen holder . toilet . not mentioned'. But the target_id and target_name passed in the ret_dict of getitem are annotated as 0 and 'shower curtain' respectively. The labels passed are also an empty array of zeros as it is a random utterance. I have a query, how is the random utterance helping the model to predict the target ('shower curtain' in this case)? Could you please briefly explain the logic behind random utterances?

Thank you!

soham-joshi commented 1 year ago

@nickgkan @ayushjain1144 for the first query, does the idea of choosing a random target help the model prevent overfitting? Please correct me if I am wrong.

Thanks!

nickgkan commented 1 year ago

Hi, target_ids are only used by the grounding evaluators, not during training. If you see the losses, we supervise for all spans that correspond to objects that exist in the scene, i.e., we detect as many objects as possible, as we write in the paper.

If you want to use DC, you need to replace the calls to DC18 with DC and then change the evaluator, which may not be trivial, but you can give it shot.

Random utterances are just detection prompts with random objects. We supervise the detection of all relevant mentioned objects while avoiding to predict the rest.

soham-joshi commented 1 year ago

Okay, got it, thanks @nickgkan for the response!