kuangliu / pytorch-retinanet

RetinaNet in PyTorch
992 stars 250 forks source link

RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other' #62

Closed sunshine-zkf closed 5 years ago

sunshine-zkf commented 5 years ago

when i run the train.py, there is a problem as fellow:

/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/loss.py:95: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.data[0]/num_pos, cls_loss.data[0]/num_peg), end=' | ') Traceback (most recent call last): File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 116, in train(epoch) File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 77, in train loss = criterion(loc_preds, loc_targets, cls_preds, cls_targets) File "/home/sunshine_zkf/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/loss.py", line 95, in forward print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.data[0]/num_pos, cls_loss.data[0]/num_peg), end=' | ') RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other'

why? Can you help me? Thank you very much!

sunshine-zkf commented 5 years ago

@kuangliu

wvalcke commented 5 years ago

In utils.py you need to change the following a = torch.arange(0,x) b = torch.arange(0,y)

by

a = torch.arange(0,x,dtype=torch.float)
b = torch.arange(0,y,dtype=torch.float)

Also you probably need to change every call like .data[0] by .item()

sunshine-zkf commented 5 years ago

In utils.py you need to change the following a = torch.arange(0,x) b = torch.arange(0,y)

by

a = torch.arange(0,x,dtype=torch.float)
b = torch.arange(0,y,dtype=torch.float)

Also you probably need to change every call like .data[0] by .item()

I modify it as you suggest, but the following errors have occurred: I tried to modify the loc_loss.data[0].item() and the following , the same errors as following. Traceback (most recent call last): File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 116, in train(epoch) File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 77, in train loss = criterion(loc_preds, loc_targets, cls_preds, cls_targets) File "/home/sunshine_zkf/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/loss.py", line 95, in forward print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.item()/num_pos, cls_loss.item()/num_peg), end=' | ') File "/home/sunshine_zkf/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 320, in rdiv return self.reciprocal() other RuntimeError: reciprocal is not implemented for type torch.cuda.LongTensor

what's problem? Can you help me,Thank you very much!

wvalcke commented 5 years ago

You did loc_loss.data[0].item()

But it should be loc_loss.item()

Check other references like this and change them all

sunshine-zkf commented 5 years ago

You did loc_loss.data[0].item()

But it should be loc_loss.item()

Check other references like this and change them all

Thank you very much! I run the train.py successfully ! The main reason seems to be the problem of pytorch's version. Except for modifying loc_loss.item(), it's necessory to modify the following: num_pos = pos.data.long().sum().item()

sunshine-zkf commented 5 years ago

You did loc_loss.data[0].item()

But it should be loc_loss.item()

Check other references like this and change them all

Sorry, disturb you. I was wondering if loss is this kind of situation is correct, when it is starting training!

loc_loss: 0.085 | cls_loss: 0.001 | train_loss: 0.087 | avg_loss: 0.088 loc_loss: 0.082 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088 loc_loss: 0.087 | cls_loss: 0.001 | train_loss: 0.088 | avg_loss: 0.088 loc_loss: 0.082 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088 loc_loss: 0.081 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088 loc_loss: 0.081 | cls_loss: 0.001 | train_loss: 0.082 | avg_loss: 0.088 loc_loss: 0.090 | cls_loss: 0.001 | train_loss: 0.091 | avg_loss: 0.088 loc_loss: 0.084 | cls_loss: 0.001 | train_loss: 0.085 | avg_loss: 0.088 loc_loss: 0.082 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088 loc_loss: 0.085 | cls_loss: 0.001 | train_loss: 0.087 | avg_loss: 0.088 loc_loss: 0.083 | cls_loss: 0.001 | train_loss: 0.085 | avg_loss: 0.088 loc_loss: 0.090 | cls_loss: 0.001 | train_loss: 0.091 | avg_loss: 0.088 loc_loss: 0.084 | cls_loss: 0.001 | train_loss: 0.085 | avg_loss: 0.088 loc_loss: 0.080 | cls_loss: 0.001 | train_loss: 0.081 | avg_loss: 0.088

wvalcke commented 5 years ago

Difficult to say without knowing what you want to train. If possible sent me your train/test index files. What are your training images ? Are you training on your own set, or an existing one ? If you are starting from an already trained model, it can be normal that the loss is very low at the beginning.

sunshine-zkf commented 5 years ago

Difficult to say without knowing what you want to train. If possible sent me your train/test index files. What are your training images ? Are you training on your own set, or an existing one ? If you are starting from an already trained model, it can be normal that the loss is very low at the beginning.

I am training on VOC2012 dataset that match the file ./data/voc12_train.txt and voc12_val.txt in this repo. I use the net.pth downloaded the onlion. So, am i staring from an already trained model ?

Then i modify the loss.py follow you #56 , the problem got a little better, but it didn't make much difference

wvalcke commented 5 years ago

Have you used the script get_state_dict.py ? This initializes the net.pth with resnet50 pretrained weights (i guess from Imagenet) and the retinanet specific layers are initialised with gaussian distribution. This net.pth that is created is not trained at all on any model. That is what i did, and training (for a specific set i trained on) starts with a loss at 2.1, then degraded while training.

sunshine-zkf commented 5 years ago

Have you used the script get_state_dict.py ? This initializes the net.pth with resnet50 pretrained weights (i guess from Imagenet) and the retinanet specific layers are initialised with gaussian distribution. This net.pth that is created is not trained at all on any model. That is what i did, and training (for a specific set i trained on) starts with a loss at 2.1, then degraded while training.

yes, i used the script get_state_dict.py and generated the net.pth.Do you train on the voc ? How do i know that the train is right.

wvalcke commented 5 years ago

I started training on Pascal VOC set, loss starts at 1.4 But during the first test evaluation it fails to load the test images, i cant' find them, from where have you downloaded those ?

wvalcke commented 5 years ago

I took the loss implementation from Issue #52 and started training on VOC The loss started with the value 0.7, training seems to be more stable than with the original code, as sometimes it went to 'nan'.

sunshine-zkf commented 5 years ago

I downloaded the images from VOC2007test, but I runed the test.py , there are many boxes on the detected image, I think there's a problem with that code. And you? I use the loss from issue #52 ,the loss is very low ,but is stable. Can I add you Wechat?

wvalcke commented 5 years ago

I trained on the VOC dataset and saw that with the loss of #52 it trained, but the results were NOK. (hundreds of boxes detected) I changed the loss function to the definition below, i retrained from scratch and after training i tested on one of the images. Now the objects were correctly detected.

 def focal_loss_alt(self, x, y):
    '''Focal loss alternative.

    Args:
      x: (tensor) sized [N,D].
      y: (tensor) sized [N,].

    Return:
      (tensor) focal loss.
    '''
    alpha = 0.25

    t = one_hot_embedding(y.data.cpu(), 1+self.num_classes)
    t = t[:,1:]
    t = Variable(t).cuda()

    xt = x*(2*t-1)  # xt = x if t > 0 else -x
    pt = (2*xt+1).sigmoid()
    pt = pt.clamp(1e-7, 1.0)

    w = alpha*t + (1-alpha)*(1-t)
    loss = -w*pt.log() / 2
    return loss.sum()
sunshine-zkf commented 5 years ago

I trained on the VOC dataset and saw that with the loss of #52 it trained, but the results were NOK. (hundreds of boxes detected) I changed the loss function to the definition below, i retrained from scratch and after training i tested on one of the images. Now the objects were correctly detected.

 def focal_loss_alt(self, x, y):
    '''Focal loss alternative.

    Args:
      x: (tensor) sized [N,D].
      y: (tensor) sized [N,].

    Return:
      (tensor) focal loss.
    '''
    alpha = 0.25

    t = one_hot_embedding(y.data.cpu(), 1+self.num_classes)
    t = t[:,1:]
    t = Variable(t).cuda()

    xt = x*(2*t-1)  # xt = x if t > 0 else -x
    pt = (2*xt+1).sigmoid()
    pt = pt.clamp(1e-7, 1.0)

    w = alpha*t + (1-alpha)*(1-t)
    loss = -w*pt.log() / 2
    return loss.sum()
sunshine-zkf commented 5 years ago

Why is it modified like this? I don't quite understand xt. I used the author another repo that is torchcv.but I get 20.3map in 2007testvoc. can i see you code modified?