vt-vl-lab / iCAN

[BMVC 2018] iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection
https://gaochen315.github.io/iCAN/
MIT License
259 stars 60 forks source link

About Spatial Jitter #29

Closed ASMIftekhar closed 5 years ago

ASMIftekhar commented 5 years ago

Thanks for the repo. In the supplemental material, you guys said to augment data by spatial jittering in the ground truth. Can you give a bit more details on this? Like how do you generate the extra ground truth samples? Also, you mentioned the human and object stream losses are calculated on the 16 positive triplets, what does mean by that?

gaochen315 commented 5 years ago

Each GT H-O pair has H_box and O_box. We randomly jitter these bounding boxes, i.e. translation, changing the ratio, but making sure the IOU between these augmented boxes and the GT boxes is larger than a threshold. We then pair the augmented H_box and O_box, and regard these pairs as positive training data as well

To train the human stream, at each iteration we focus on one H-O pair. This gives us one H_box. We can also borrow 15 additional augmented H_boxes. We train the human stream with these (1+15=16) H_boxes. Thus the batch size is 16 (even though all 16 boxes are about the same H).

Please let me know if I explain the procedure clearly. Thanks.

ASMIftekhar commented 5 years ago

Thanks for your response. I have another question related to your answer, did you train the human stream with detected bounding boxes from Fast RCNN+augmented_boxes or ground truth boxes+augmented_boxes?

gaochen315 commented 5 years ago

Ground truth boxes+augmented boxes.

ASMIftekhar commented 5 years ago

Thank you.