Closed soham-joshi closed 1 year ago
Another related query: With --joint_det and random_utt, (datasets are scannet and sr3d) the algorithm creates a random utterance (with the help of sampled classes): 'smoke detector . alarm . kinect . shower curtain . printer . wall . crate . bowl . seat . door . garbage bag . cable . toilet seat cover dispenser . bathroom cabinet . light switch . soap dispenser . bar . floor . pen holder . toilet . not mentioned'. But the target_id and target_name passed in the ret_dict of getitem are annotated as 0 and 'shower curtain' respectively. The labels passed are also an empty array of zeros as it is a random utterance. I have a query, how is the random utterance helping the model to predict the target ('shower curtain' in this case)? Could you please briefly explain the logic behind random utterances?
Thank you!
@nickgkan @ayushjain1144 for the first query, does the idea of choosing a random target help the model prevent overfitting? Please correct me if I am wrong.
Thanks!
Hi, target_ids are only used by the grounding evaluators, not during training. If you see the losses, we supervise for all spans that correspond to objects that exist in the scene, i.e., we detect as many objects as possible, as we write in the paper.
If you want to use DC, you need to replace the calls to DC18 with DC and then change the evaluator, which may not be trivial, but you can give it shot.
Random utterances are just detection prompts with random objects. We supervise the detection of all relevant mentioned objects while avoiding to predict the rest.
Okay, got it, thanks @nickgkan for the response!
Hey @ayushjain1144 @nickgkan , I wanted to confirm a query: For the ScanNet dataset (only), the training and testing tasks are object detection. The target_ids will be the objects from the DC18 vocab. But according to the src/joint_det_dataset.py line 761 every time only the first object is passed as the target. Sometimes the utterance created (concatenation of object names) might be irrelevant for items not occurring in DC18. (for example in the scene 'scene0405_00', objects like 'trash can', and utterance like 'cabinet . bed . chair . couch . table . door . window . bookshelf . picture . counter . desk . curtain . refrigerator . shower curtain . toilet . sink . bathtub . other furniture') Could you confirm if I am correct? I have the following queries also: