A clear explanation on matching strategy

davidnvq commented 6 years ago

Thank you Shifeng for your great work and big congratulations on your new CVPR2018 and IJCAI work. I just looked around your website and I'm truly impressed what you did so far. I believe that you are and will be an awesome scientist who has a great impact to our society. Definitely!

Back to the SFD, it is now more than one year since your publication but I still find it useful and one of state-of-art on Face Detection. Though, I'm confused somehow with the matching strategy of SFD. I found your answer from previous issue:

We just directly use IoU threshold = 0.35 choose them all After step 1, for these GT whose matched anchors are less than 6, we use step 2 to choose its top6 > anchor to match

As my understanding:

One anchor_box can be assigned to only one ground truth label (or 1 face).
After step 1, some faces might have more than or equal to 6, some might have less than 6. But none of faces have the same anchor_box because of the fact #1. At step 1, each anchor_box is assigned to the best match (to the face with the highest iou).
Now we move the step 2, there is some confusion for me here: To make it simple, for example, just say we need to find N = 4. There is a list of faces which has less than 4 anchor_boxes list_faces = [ face1, face2, face3], and some [available_anchor_boxes = box1, box2, box3, box4, box5]still available (weren't assigned to any faces in step 1). and the number of matched anchor boxes for face1, face2, face3 is [2, 3, 1]. All of them < topN (4).
__|box1| box2 | box3 | box4 | box5| face1| 0.12 | 0.21 | 0.06 | 0.13 | 0.24 | face2| 0.24| 0.08 | 0.23 | 0.1 | 0.34 | face3| 0.33| 0.22 | 0.01 | 0.2 | 0.02 |

My question is: A. So we have to find topN more anchor boxes for face1, face2, face3 -> so there new # of anchor boxes should be [2 + 4, 3 + 4, 1 + 4] or just find topK = topN - # of already matched anchor boxes, like[2 + 2, 3 + 1, 1 + 3]`.

B. In case of [2 + 2, 3 + 1, 1 + 3], so we will iterate list_faces sequentially from face1 -> face2 -> face3 to find topK ?

C. If B is correct, for face1, we will find the topK = 2, say, [box5, box2]. It means that we will mark box5 and box2 as matched anchors (then they will not be assigned for any other face anymore). However, the best match for box5 is face2, and box2 is face3. After assigning topK anchor boxes for face1, the available boxes now are [box1, box3, box4] If this method is performed, face2 will be assigned with box1. The available boxes now is [box3, box4], and we can just assign box4 for face3 even we need to assign topK = 3.

I know that the probability for this overlapping C case happening is very low when we have the huge number anchor boxes, say here 33125, not 5 boxes like the above example. However, I just want to make sure that your idea is correctly implemented like this or different?

I hope the above example is simple and intuitive enough to point out my confustion and make your idea clearer to us. I would really appreciate if you can review my understanding 1, 2, 3 and my question A, B, C. Many thanks and have a nice day.

sfzhang15 commented 6 years ago

@quangdtsc Hi, Thanks. A. Just find topK = topN - # of already matched anchor boxes, like[2 + 2, 3 + 1, 1 + 3]. B. Yes. Besides, we have a variable to mark the best match face for every anchor, e.g., every face has a list to store their candicant anchors, like face1_anchor_list = [], face2_anchor_list = [box3, box5], face2_anchor_list = [box1, box2, box4]. So we will iterate list_faces sequentially from face1 -> face2 -> face3 to find topK in their anchor_list. C. Please refer to B for the answer.

Besides, some faces do not have enough candicant anchors, e.g., face1 in above example, so it will not match more anchors in step2.

davidnvq commented 6 years ago

@sfzhang15 Probably, your answer is quite clear to me. Thank you for your reply. For training time, could you share a little about the time for 1 full epoch with your computer configuration? I figure out the classification layer conv3_3 would slow the training down quite a lot. The training time on 1 epoch for me is 1 hour and 10 minutes (quite slow) on GTX 1080 Nvidia 8GB and CPU 8 cores. Maybe your answer would be a nice benchmark for me to confirm to myself that my implementation is correct in terms of time complexity.

sfzhang15 commented 6 years ago

@quangdtsc For 640x640 input image with 32 batch size, we use 2 Titan X (maswell) GPUs to train our model. Every iteration takes about 4.5s. So for 12W iterations, it needs about 6 days. As you said, the classification layer conv3_3 slows the training down quite a lot.

sfzhang15 / SFD

A clear explanation on matching strategy #32