sgrvinod / a-PyTorch-Tutorial-to-Object-Detection

SSD: Single Shot MultiBox Detector | a PyTorch Tutorial to Object Detection
MIT License
3.05k stars 719 forks source link

prior box not clipped as expected #94

Open Hope1337 opened 1 year ago

Hope1337 commented 1 year ago

image

As you stated, prior box will be clipped if it out side of image, but in your code :

prior_boxes.clamp_(0, 1)  # (8732, 4)

prior_boxes still in yolo style (which mean [cx, cy, w, h]), but if we want to clip it, we will want some thing like this :

        prior_boxes = pascalVOC_style(dboxes) #convert to (xmin, ymin, xmax, ymax)
        prior_boxes.clamp_(0, 1)
        prior_boxes = yolo_style(dboxes) #convert to (cx, cy, w, h)

Am I right ?

AndreiMoraru123 commented 1 year ago

That's an interesting point, but but I think this is covered. Notice how the center-encoded coordinates are also in percentages:

image

So clamping in between 0 and 1 should work the same.

Hope1337 commented 1 year ago

@AndreiMoraru123 image w and h just size of prior box, I mean, the top-left point and the right-bottom point of rectangle need to be recalculated from cx, cy, w, h and it can negative or greater than 1, therefore it still can be outside of the image Let say cx, cy = (0.1, 0.1) but w, h = (0.3, 0.3). You can see that cx, cy, w and h still in valid range but the prior box still outside. I showed prior box of author code in the above image, you can see that happened.

AndreiMoraru123 commented 1 year ago

I see your point. You might be onto something. Indeed, if you generate a prior box that is centered close enough to one corner of the image, you could easily overshoot the image with widths and heights that are still within the image bound. Like here, generating prior which is centered at 20% of the (500, 500) image => (100, 100), any (width, height) combo greater than 40% of the image will overshoot and clamp will do nothing about it.

cx, cy, w, h = 0.2, 0.2, 0.5, 0.5

image

Maybe this is worth looking into.

Hope1337 commented 1 year ago

@AndreiMoraru123 Yup, as I stated, we can avoid it by first convert (cx, cy, w, h) to (xmin, ymin, xmax, ymax) and then clamp it into range [0 .. 1] But I wonder how author can achieve the same performance with paper by using these prior box, or origin caffe repo used the same thing ?

Hope1337 commented 1 year ago

@sgrvinod sorry for bothering you but I think it's a bug, beside that, pretrain link seem to be died

sgrvinod commented 1 year ago

Hi @NicolasHarding, I just looked at it, and I think you are right. Thanks for pointing this out. I was careless and the priors won't be clipped. Therefore, some of my priors are different from those in the paper.

I'm guessing I still get the same performance because there are always enough priors so that even for objects at the edges of images with the "clipped" sizes, there's always a different prior that's close enough and/or the offset calculation is good enough.

I'm not completely sure how to proceed now because immediately modifying my code to clip those priors will affect use of the pretrained model which is trained with the different priors. I'll probably make a note of this issue in the code and tutorial in the next couple of days.

About the pretrained model, I could access it and I know some others could as well. Could you check again? I just changed the permissions of the Google Drive folder above it as well. If it still doesn't work, let me know, I'll make it available elsewhere. @AndreiMoraru123, do you have the same problem?

AndreiMoraru123 commented 1 year ago

The link to the torch checkpoint works from Chrome. Google drive is always iffy on other browsers for reasons, @NicolasHarding. Also, @sgrvinod, I know you made these tutorials years ago, but now I'm thinking huggingface would be the best place to upload models and get more exposure for your work.

sgrvinod commented 1 year ago

@AndreiMoraru123 thanks for confirming! I'll check out HuggingFace, this idea never occurred to me, thank you.

Hope1337 commented 1 year ago

@sgrvinod I sincerely apologize for bothering you multiple times, but it seems that this is not correct : In class MultiBoxLoss :

self.smooth_l1 = nn.L1Loss()

nn.L1Loss() is not actually smooth l1 loss

sgrvinod commented 1 year ago

@sgrvinod I sincerely apologize for bothering you multiple times, but it seems that this is not correct : In class MultiBoxLoss :

self.smooth_l1 = nn.L1Loss()

nn.L1Loss() is not actually smooth l1 loss

Hi @AakiraOtok, it's not a bother at all. Yeah, I was aware of this. Thanks for reminding me. See #60. I'll make a note of this difference from the paper in the tutorial and code, but probably won't change the loss function now as L1 loss appears to work just as well.

Hope1337 commented 1 year ago

Hi @AakiraOtok, it's not a bother at all. Yeah, I was aware of this. Thanks for reminding me. See #60. I'll make a note of this difference from the paper in the tutorial and code, but probably won't change the loss function now as L1 loss appears to work just as well.

Thanks for fast reply, there's another thing I think you can add in your note, that is, use l1 instead of smooth version can cause exploding gradient (which happened to you) so you must choose batch_size=8. If we want to use batch_size = 32, change to smooth_l1, loss start from around 24 and converge to ~ 2.5 (mean) after 120k iteration.

Also, I'm experiment with variant of implement. If you interested whether not clipped prior box can affect to mAP, I also testing it, so I will update result here (comment section) if you feel no problem about it.

sgrvinod commented 1 year ago

Hi @AakiraOtok, it's not a bother at all. Yeah, I was aware of this. Thanks for reminding me. See #60. I'll make a note of this difference from the paper in the tutorial and code, but probably won't change the loss function now as L1 loss appears to work just as well.

Thanks for fast reply, there's another thing I think you can add in your note, that is, use l1 instead of smooth version can cause exploding gradient (which happened to you) so you must choose batch_size=8. If we want to use batch_size = 32, change to smooth_l1, loss start from around 24 and converge to ~ 2.5 (mean) after 120k iteration.

Also, I'm experiment with variant of implement. If you interested whether not clipped prior box can affect to mAP, I also testing it, so I will update result here (comment section) if you feel no problem about it.

Sure, thanks.