lufficc / SSD

High quality, fast, modular reference implementation of SSD in PyTorch
MIT License
1.52k stars 384 forks source link

Training problem on custom dataset #167

Closed Vlad15lav closed 4 years ago

Vlad15lav commented 4 years ago

Hi. I'm training on a DIOR dataset, but I have a problem with that. Loss: {'reg_loss': tensor(413458.5938, device='cuda:0', grad_fn=), 'cls_loss': tensor(17.1876, device='cuda:0', grad_fn=)} {'reg_loss': tensor(413339.7188, device='cuda:0', grad_fn=), 'cls_loss': tensor(18.7562, device='cuda:0', grad_fn=)} {'reg_loss': tensor(413171.5938, device='cuda:0', grad_fn=), 'cls_loss': tensor(28.1600, device='cuda:0', grad_fn=)} {'reg_loss': tensor(412951.4688, device='cuda:0', grad_fn=), 'cls_loss': tensor(17.6305, device='cuda:0', grad_fn=)} {'reg_loss': tensor(412733.2188, device='cuda:0', grad_fn=), 'cls_loss': tensor(22.3218, device='cuda:0', grad_fn=)} {'reg_loss': tensor(412515.8438, device='cuda:0', grad_fn=), 'cls_loss': tensor(36.3204, device='cuda:0', grad_fn=)} {'reg_loss': tensor(412336.1562, device='cuda:0', grad_fn=), 'cls_loss': tensor(598.9423, device='cuda:0', grad_fn=)} {'reg_loss': tensor(412118.0625, device='cuda:0', grad_fn=), 'cls_loss': tensor(8.7820e+08, device='cuda:0', grad_fn=)} {'reg_loss': tensor(413883.9688, device='cuda:0', grad_fn=), 'cls_loss': tensor(nan, device='cuda:0', grad_fn=)} {'reg_loss': tensor(inf, device='cuda:0', grad_fn=), 'cls_loss': tensor(nan, device='cuda:0', grad_fn=)} {'reg_loss': tensor(inf, device='cuda:0', grad_fn=), 'cls_loss': tensor(nan, device='cuda:0', grad_fn=)} {'reg_loss': tensor(inf, device='cuda:0', grad_fn=), 'cls_loss': tensor(nan, device='cuda:0', grad_fn=)} {'reg_loss': tensor(inf, device='cuda:0', grad_fn=), 'cls_loss': tensor(nan, device='cuda:0', grad_fn=)} Also the labels after SSDTargetTransform look like this: {'boxes': tensor([[[1.6875e+05, 2.1252e+05, 9.1088e+01, 9.0052e+01], [1.6875e+05, 2.1252e+05, 9.1088e+01, 9.0052e+01], [1.1932e+05, 3.0055e+05, 8.7622e+01, 9.3518e+01], ..., [5.6944e+03, 7.1745e+03, 5.7220e+01, 5.6184e+01], [4.0460e+03, 1.0195e+04, 5.3803e+01, 5.9698e+01], [8.0921e+03, 5.0977e+03, 6.0734e+01, 5.2766e+01]]]), 'labels': tensor([[8, 0, 0, ..., 0, 0, 0]])} Is this normal? What could be wrong? Thanks for the answer.

lufficc commented 4 years ago

Have you followed the develop guide ? For example, boxes should be in x1, y1, x2, y2 format (absolute value), images without annotations should be filtered out.

Vlad15lav commented 4 years ago

Thanks for the answer. Yes, boxes are x1, y1, x2, y2 format. I solved part of the problem, used your build_transforms. Labels and images look like this: There are negative values, I do not know is this normal?

{'boxes': tensor([[[299.1490, 544.6567, 24.0358, 35.8938], [299.1490, 544.6567, 24.0358, 35.8938], [211.5303, 770.2609, 20.5700, 39.3595], ..., [ -2.0578, 6.2445, -9.8322, 2.0258], [ -1.4621, 8.8738, -13.2496, 5.5398], [ -2.9242, 4.4369, -6.3182, -1.3916]]]), 'labels': tensor([[0, 0, 0, ..., 0, 0, 8]])}

tensor([[[[ 34.9677, 35.3284, 36.3284, ..., -56.6716, -71.6716, -66.6716], [ 37.1944, 37.3284, 36.3284, ..., -64.6716, -75.6716, -67.6716], [ 35.1944, 35.3284, 35.3284, ..., -72.6716, -73.6716, -64.6716], ..., [-44.6716, -40.6716, -41.6716, ..., 50.3284, 54.3284, 50.3284], [-43.6716, -40.6716, -44.6716, ..., 25.8337, 40.8337, 56.8337], [-44.6716, -40.6716, -47.6716, ..., 18.8337, 25.8337, 41.8337]],

     [[ 36.3284,  38.3284,  41.3284,  ..., -48.6716, -63.6716, -58.6716],
      [ 38.3284,  40.3284,  41.3284,  ..., -55.6716, -66.6716, -59.6716],
      [ 36.3284,  38.3284,  41.3284,  ..., -61.6716, -64.6716, -55.6716],
      ...,
      [**-19.6716**, -15.6716, -16.6716,  ...,  55.3284,  58.3284,  55.3284],
      [**-18.6716**, -15.6716, -19.6716,  ...,  33.3284,  48.3284,  64.3284],
      [**-19.6716**, -15.6716, -22.6716,  ...,  26.3284,  33.3284,  49.3284]],

     [[ 39.3284,  41.6892,  45.1428,  ..., -48.1768, -64.1768, -59.1768],
      [ 42.3284,  43.6892,  45.1428,  ..., -56.9500, -67.9501, -60.1768],
      [ 40.3284,  41.6892,  44.3696,  ..., -64.4965, -64.9501, -55.9501],
      ...,
      [-29.3213, -25.3213, -26.3213,  ...,  57.1428,  61.9160,  57.1428],
      [-28.3213, -25.3213, -29.3213,  ...,  31.3284,  46.3284,  62.3284],
      [-29.3213, -25.3213, -32.3213,  ...,  24.3284,  31.3284,  47.3284]]]])
priteshgohil commented 4 years ago

boxes should be in x1, y1, x2, y2 format

@lufficc May I know why ToPercentCoords() transform is not applied in inference? I used it to get validation loss because I was facing a similar issue. Will it have any impact on other processing blocks if I normalize for the test dataset.

Vlad15lav commented 4 years ago

@lufficc @priteshgohil Please tell me the values of targets['boxes'] can be negative? I checked min and max: -1193.6273193359375 1193.079833984375 I have problem total_loss = nan after 1000 iterations. I also checked the annotation (xmax > xmin and ymax > ymin). I train images with small objects in optical remote sensing images, using the input size 512, with my cfg MIN_SIZES, MAX_SIZES using K-means.

lufficc commented 4 years ago

boxes should be in x1, y1, x2, y2 format

@lufficc May I know why ToPercentCoords() transform is not applied in inference? I used it to get validation loss because I was facing a similar issue. Will it have any impact on other processing blocks if I normalize for the test dataset.

Because we don't need box information when inference. For convenience and speed, ToPercentCoords() is not used. It won't impact other blocks.

lufficc commented 4 years ago

@lufficc @priteshgohil Please tell me the values of targets['boxes'] can be negative? I checked min and max: -1193.6273193359375 1193.079833984375 I have problem total_loss = nan after 1000 iterations. I also checked the annotation (xmax > xmin and ymax > ymin). I train images with small objects in optical remote sensing images, using the input size 512, with my cfg MIN_SIZES, MAX_SIZES using K-means.

The values of targets['boxes'] cannot be negative. In your dataset, boxes values are absolute value. But when compute loss, these values are normalized to 0-1(This is done by ToPercentCoords() transform).

Vlad15lav commented 4 years ago

I solved the problem. Set the learning rate 1e-4. Thanks.

Shakesbeer333 commented 3 years ago

@lufficc @Vlad15lav @priteshgohil

The values of targets['boxes'] cannot be negative. In your dataset, boxes values are absolute value. But when compute loss, these values are normalized to 0-1(This is done by ToPercentCoords() transform).

Negative values occur because of a wrong implementation of RandomMirror(). The reason is the mixup of height and width. You can reproduce the behavior by testing the augmentation with imgaug.augmenters.Fliplr() and imgaug.augmentables.bbs.BoundingBoxesOnImage. Latter takes the same x1,y1,x2,y2 bbox format as well as the image shape (H,W,[C]) as input. If you mix up height and width you'll get (wrong) negative values which are identical to the ones calculated by RandomMirror(). You can verify the results by displaying the image with its labels (e.g. use pycocotools).

priteshgohil commented 3 years ago

@Shakesbeer333 I don't think so. Initially, I also had a doubt but I have verified RandomMirror() function. See below code and images,

Note: Ignore the RGB and BGR image format

image