Training on new dataset

ljtruong commented 6 years ago

I'm attempting to train on a new dataset but I'm having trouble understanding where I should change my classes.I've changed it when feeding in the network. box_coder and multiloss box. I'm having an error here when I feed in my network.

https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/torchcv/models/ssd/box_coder.py#L88

I've removed the 1 + and was able to continue training, but I'm sure this isn't the correct fix.

When I have 37 classes, including background at 0 index. What is the class number I should feed into the network?

ahkarami commented 6 years ago

Dear @Worulz, Don't change the original code. Just note that, when you want to use SSDLoss then you must set:

num_classes = Number of Classes in your data set + 1 (For background)
# Example, in your case:
num_classes = 38  # because 37 + 1= 38

& when you want to use Focal Loss then you must set:

num_classes = Number of Classes in your data set
# Example, in your case:
num_classes = 37  # because you have really 37 object classes

Note that these mentioned changes must apply in https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/examples/ssd/train.py#L91 & https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/examples/ssd/train.py#L36. Good Luck

ljtruong commented 6 years ago

@ahkarami

Thank you for your guidance. I have made the changes. I've changed it to match my classes then +1 for background.

Here: https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/examples/ssd/train.py#L37

and here: https://github.com/kuangliu/torchcv/blob/6291f3e1e4bbf6467fd6b1e79001d34a59481bb6/examples/ssd/train.py#L91

I still experience the same error.

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [748,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [19,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [598,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [25,0,0], thread: [599,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 122, in <module>
    loss, loc_loss, cls_loss = criterion(bbox_preds, boxes, cls_preds, labels)
  File "/home/ubuntu/py3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/projects/DeepLearningSSD/lib/model/detector/ssd_loss.py", line 59, in forward
    cls_loss[cls_targets<0] = 0  # set ignored loss to 0
RuntimeError: copy_if failed to synchronize: device-side assert triggered

It happens in SSD loss when my cls_target has the last class feeding into it. It's very weird. It means there is a class mis-match. Is there a dependency anywhere else?

ahkarami commented 6 years ago

Dear @Worulz, Please pay attestation that you have used the torchcv/examples/ssd (i.e., SSD CNN Model example for detection); however, you have used a net = FPNSSD512(num_classes=21) (i.e., FPN Model)!!! If you have used the torchcv/examples/ssd codes, then make a SSD model & if you want to use the FPN model then use the corresponding codes of it, in the torchcv/examples/fpnssd. Also note that, I think SSD model codes are based on the PyTorch 0.3 & FPN model codes are based on PyTorch 0.4. You can use both version of PyTorch as I have mentioned in https://github.com/ahkarami/Ubuntu-for-Deep-Learning#install-2-different-versions-of-a-package-eg-pytorch-on-a-single-system

ljtruong commented 6 years ago

@ahkarami thanks for the help. I'll give it a try again. I assume I may have an error when writing my own example scripts.

kuangliu / torchcv

Training on new dataset #37