kuangliu / pytorch-retinanet

RetinaNet in PyTorch
992 stars 250 forks source link

Update loss.py #49

Open banjuanshua opened 5 years ago

banjuanshua commented 5 years ago

Fix clc loss nan bug. Sum of cls_loss should divide total number of pos_neg

SCoulY commented 5 years ago

Good job

pengfeidip commented 5 years ago

@banjuanshua Have you successfully run the code?

ridhajuneja123 commented 5 years ago

This is also giving nan loss how the above solution solve the problem can u explain

noahcao commented 4 years ago

For the best, you should also add clamp in the focal_loss in case the sigmoid produce inf value:

def focal_loss_alt(self, x, y):
        '''Focal loss alternative.

        Args:
          x: (tensor) sized [N,D].
          y: (tensor) sized [N,].

        Return:
          (tensor) focal loss.
        '''
        alpha = 0.25

        t = one_hot_embedding(y.data.cpu(), 1+self.num_classes)
        t = t[:,1:]
        t = Variable(t).cuda()

        xt = x*(2*t-1)  # xt = x if t > 0 else -x
        pt = (2*xt+1).sigmoid()

        w = alpha*t + (1-alpha)*(1-t)
        logit = pt.clamp(self.eps, 1.-self.eps)
        loss = -w*logit.log() / 2

        return loss.sum()

This should be better, where we could set self.eps=1e-7

zjysteven commented 4 years ago

Have you trained a satisfying model with your modification? In my case, dividing cls_loss with pos_neg will give a very small cls_loss (e.g. around 0.0020) since the number of background anchors is so large. However, after training, the classification outputs were all near 0 and I think it is because such a small cls_loss prevented the model from learning.

Also, other implementations below both uses num_pos as the normalizer. So I don't think your PR is actually helpful. https://github.com/fizyr/keras-retinanet/blob/a8bbd6535655602a84f194b1557be1ed5180fa71/keras_retinanet/losses.py#L61-L65 https://github.com/yhenon/pytorch-retinanet/blob/bb11485e410dba9da3578de5b2cb36d9cd6fdd03/retinanet/losses.py#L92

twmht commented 4 years ago

@zjysteven

what is the training cls loss when you use only num_pos as the normalizer?

zjysteven commented 4 years ago

@twmht Sorry it's been a long time so I can't remember what exactly the value of that is. But I have confirmed with the authors of several other implementations, and it is correct to divide the cls loss by only num_pos. It is also the case in Facebook's Detectron2, so you can be assured to do so. However, according to my own experience, this repo may have some other subtle implementation errors so it is not guaranteed that you get desirable results with this repo.