Open ljtruong opened 6 years ago
Dear @Worulz, Don't change the original code. Just note that, when you want to use SSDLoss then you must set:
num_classes = Number of Classes in your data set + 1 (For background)
# Example, in your case:
num_classes = 38 # because 37 + 1= 38
& when you want to use Focal Loss then you must set:
num_classes = Number of Classes in your data set
# Example, in your case:
num_classes = 37 # because you have really 37 object classes
Note that these mentioned changes must apply in https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/examples/ssd/train.py#L91 & https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/examples/ssd/train.py#L36. Good Luck
@ahkarami
Thank you for your guidance. I have made the changes. I've changed it to match my classes then +1 for background.
I still experience the same error.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [748,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [19,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [599,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
File "train.py", line 122, in <module>
loss, loc_loss, cls_loss = criterion(bbox_preds, boxes, cls_preds, labels)
File "/home/ubuntu/py3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/projects/DeepLearningSSD/lib/model/detector/ssd_loss.py", line 59, in forward
cls_loss[cls_targets<0] = 0 # set ignored loss to 0
RuntimeError: copy_if failed to synchronize: device-side assert triggered
It happens in SSD loss when my cls_target has the last class feeding into it. It's very weird. It means there is a class mis-match. Is there a dependency anywhere else?
Dear @Worulz,
Please pay attestation that you have used the torchcv/examples/ssd
(i.e., SSD CNN Model example for detection); however, you have used a net = FPNSSD512(num_classes=21)
(i.e., FPN Model)!!!
If you have used the torchcv/examples/ssd
codes, then make a SSD model & if you want to use the FPN model then use the corresponding codes of it, in the torchcv/examples/fpnssd
.
Also note that, I think SSD model codes are based on the PyTorch 0.3 & FPN model codes are based on PyTorch 0.4. You can use both version of PyTorch as I have mentioned in
https://github.com/ahkarami/Ubuntu-for-Deep-Learning#install-2-different-versions-of-a-package-eg-pytorch-on-a-single-system
@ahkarami thanks for the help. I'll give it a try again. I assume I may have an error when writing my own example scripts.
I'm attempting to train on a new dataset but I'm having trouble understanding where I should change my classes.I've changed it when feeding in the network. box_coder and multiloss box. I'm having an error here when I feed in my network.
https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/torchcv/models/ssd/box_coder.py#L88
I've removed the 1 + and was able to continue training, but I'm sure this isn't the correct fix.
When I have 37 classes, including background at 0 index. What is the class number I should feed into the network?