ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.21k stars 3.45k forks source link

Multi-scale incompatibility with Rectangular Training #358

Closed cmiller2air closed 5 years ago

cmiller2air commented 5 years ago

Describe the bug I used 1024*288 images as a training image, using four 1080ti gpus. I tried to enable both rectangular_training and multi-scale_learning but tensor size mismatch occurred on route modules. The error seems to occur because the input image doesn't have multiples of 32. It seems the images are resized and padded to have multiples of 32, but they are changed by resizing part of multi-scale module. Switching the location of the two modules may fix the problem. Or using estimated width and height to scale the image instead of using resizing factor would help too.

glenn-jocher commented 5 years ago

@cmiller2air lets see. Rectangular training attempts to sort all of your training images by aspect ratio, and then groups similar aspect ratio images into a single batch. It will then pad all of the images in the batch as necessary to achieve a minimum pad given the constraint that the image dimensions must be multiples of 32.

I'm not sure if rectangular training and multi scale are mutually exclusive, though one important problem with rectangular training is that you need to be sure that the images are not shuffled by the data loader.

test.py uses rectangular images by default in this repo, so the rectangular functionality is operating properly by itself. Can you post your command and the screen outputs of the command?

glenn-jocher commented 5 years ago

I took a look at this over here. It seems like the interpolation operation in train.py is resizing the long size to a new multiple of 32, though the new image is no longer subject to the 32-multiple constraint on the shorter side. This is what is causing the errors.

https://github.com/ultralytics/yolov3/blob/ab141fcc1ff976fa9d1bd983e11fe43ba4628e2e/train.py#L199-L206

This will need some significant additional logic to correct. We will leave this issue open. I can't give you a timeline for a fix, but an immediate workaround is to use rectangular training with --single-scale.

Bienqq commented 5 years ago

I have experienced the same problem. I cannot use --rect flag together with --multi-scale when I try to do this I got belowing error:


Traceback (most recent call last):
  File "train.py", line 334, in <module>
    accumulate=opt.accumulate)
  File "train.py", line 207, in train
    pred = model(imgs)
  File "/home/tomekb/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tomekb/pytorch_yolov3/models.py", line 189, in forward
    x = torch.cat([layer_outputs[i] for i in layer_i], 1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 27 and 28 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71
glenn-jocher commented 5 years ago

@Bienqq @cmiller2air since we now had multiple requests we elevated the issue status and have now implemented a fix in https://github.com/ultralytics/yolov3/commit/44b340321fef16ee21e9fba43caa7ddebef03c2f. Multiscale training should now be compatible with rectangular training.

Please git pull and try again. Thank you!

glenn-jocher commented 5 years ago

Everything appears to be operating correctly in our COCO tests:

python3 train.py --data data/coco.data --img-size 320 --rect --multi-scale

Namespace(accumulate=4, batch_size=16, bucket='', cfg='cfg/yolov3-spp.cfg', data='data/coco.data', epochs=100, evolve=False, img_size=320, multi_scale=True, nosave=False, notest=False, num_workers=4, rect=True, resume=False, transfer=False, var=0, xywh=False)
Using CUDA with Apex device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15079MB)

Reading image shapes: 100% 117263/117263 [03:24<00:00, 572.30it/s]
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients

     Epoch   gpu_mem   GIoU/xy        wh       obj       cls     total   targets  img_size
      0/99      3.7G     0.565         0      10.6      13.2      24.4        93       352:   4% 297/7329 [01:55<33:54,  3.46it/s]
glenn-jocher commented 5 years ago

Closing as issue should be resolved.