Closed ASMIftekhar closed 5 years ago
Each GT H-O pair has H_box and O_box. We randomly jitter these bounding boxes, i.e. translation, changing the ratio, but making sure the IOU between these augmented boxes and the GT boxes is larger than a threshold. We then pair the augmented H_box and O_box, and regard these pairs as positive training data as well
To train the human stream, at each iteration we focus on one H-O pair. This gives us one H_box. We can also borrow 15 additional augmented H_boxes. We train the human stream with these (1+15=16) H_boxes. Thus the batch size is 16 (even though all 16 boxes are about the same H).
Please let me know if I explain the procedure clearly. Thanks.
Thanks for your response. I have another question related to your answer, did you train the human stream with detected bounding boxes from Fast RCNN+augmented_boxes or ground truth boxes+augmented_boxes?
Ground truth boxes+augmented boxes.
Thank you.
Thanks for the repo. In the supplemental material, you guys said to augment data by spatial jittering in the ground truth. Can you give a bit more details on this? Like how do you generate the extra ground truth samples? Also, you mentioned the human and object stream losses are calculated on the 16 positive triplets, what does mean by that?