OpenCV error upon mAP calculation

bigrobinson commented 5 years ago

First of all, thank you for the hard work and the great documentation. Can you help with this error? It occurs upon calculation of the mAP and appears related to an opencv rendering. I have 4 classes I am training with 1720 labeled training images. Any help is much appreciated.

*** cfg file hyperparameters [net] batch=64 subdivisions=8 width=608 height=608 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 learning_rate=0.001 burn_in=2000 max_batches=500200 policy=steps steps=400000,450000 scales=.1,.1

** cfg file pre-yolo layer parameters [convolutional] size=1 stride=1 pad=1 filters=27 activation=linear

****Error out

OpenCV(3.4.1) Error: Assertion failed (top >= 0 && bottom >= 0 && left >= 0 && right >= 0) in copyMakeBorder, file /opt/conda/conda-bld/opencv-suite_1527005194613/work/modules/core/src/copy.cpp, line 1182 Traceback (most recent call last): File "train.py", line 347, in accumulate=opt.accumulate, File "train.py", line 260, in train conf_thres=0.1) File "/home/brian/yolov3/test.py", line 60, in test for batch_i, (imgs, targets, paths, shapes) in enumerate(tqdm(dataloader, desc='Computing mAP')): File "/home/brian/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1005, in iter for obj in iterable: File "/home/brian/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next return self._process_next_batch(batch) File "/home/brian/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch raise batch.exc_type(batch.exc_msg) cv2.error: Traceback (most recent call last): File "/home/brian/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/brian/anaconda3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/brian/yolov3/utils/datasets.py", line 269, in getitem img, ratio, padw, padh = letterbox(img, new_shape=shape, mode='rect') File "/home/brian/yolov3/utils/datasets.py", line 359, in letterbox img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color) # padded square cv2.error: OpenCV(3.4.1) /opt/conda/conda-bld/opencv-suite_1527005194613/work/modules/core/src/copy.cpp:1182: error: (-215) top >= 0 && bottom >= 0 && left >= 0 && right >= 0 in function copyMakeBorder

glenn-jocher commented 5 years ago

Hello, thank you for your interest in our work! Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a fresh git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:

sudo rm -rf yolov3  # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py  # verify detection
python3 train.py  # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE

Your custom data. If your issue is not reproducible with COCO data we can not debug it. Visit our Custom Training Tutorial for exact details on how to format your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in a GCP Quickstart Guide VM we can not debug it. Ensure you meet the requirements specified in the README: Unix, MacOS, or Windows with Python >= 3.7, Pytorch >= 1.1, etc. You can also use our Google Colab Notebook to test your code in working environment.

In your case it looks like the OpenCV error occurs in test.py when attempting to plot test_batch0.jpg, which shows you the training data with the labels. It's likely there is an error with your training data somewhere. You can run a validated working example like this and try to debug from there:

python3 train.py --data data/coco_16img.data --epochs 1

Namespace(accumulate=8, backend='nccl', batch_size=8, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco_16img.data', dist_url='tcp://127.0.0.1:9999', epochs=1, evolve=False, giou=False, img_size=320, nosave=False, notest=False, num_workers=4, rank=0, resume=False, single_scale=False, transfer=False, var=0, world_size=1)
Using CPU

Reading labels: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 10275.43it/s]
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients

   Epoch       Batch        xy        wh      conf       cls     total   targets      time
     0/0         0/1     0.418      0.67      26.3      3.73      31.1        45      13.8
     0/0         1/1     0.383     0.552      26.2      3.72      30.9        27      13.6
1 epochs completed in 0.008 hours.
Reading labels: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 12363.46it/s]
               Class    Images   Targets         P         R       mAP        F1
Computing mAP: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.57s/it]
                 all        16        76         0         0         0         0
              person        16        16         0         0         0         0
                 car        16        14         0         0         0         0
          motorcycle        16         1         0         0         0         0
            airplane        16         1         0         0         0         0
               train        16         1         0         0         0         0
               truck        16         3         0         0         0         0
           stop sign        16         1         0         0         0         0
               horse        16         2         0         0         0         0
            elephant        16         2         0         0         0         0
               zebra        16         1         0         0         0         0
             giraffe        16         4         0         0         0         0
            umbrella        16         1         0         0         0         0
             handbag        16         1         0         0         0         0
          skateboard        16         3         0         0         0         0
                fork        16         1         0         0         0         0
               knife        16         5         0         0         0         0
                bowl        16         3         0         0         0         0
              orange        16         4         0         0         0         0
            broccoli        16         1         0         0         0         0
                cake        16         1         0         0         0         0
        potted plant        16         2         0         0         0         0
           microwave        16         1         0         0         0         0
                oven        16         1         0         0         0         0
                book        16         3         0         0         0         0
               clock        16         2         0         0         0         0
                vase        16         1         0         0         0         0

train_batch0.jpg

test_batch0.jpg

bigrobinson commented 5 years ago

Thanks for the prompt response. I ran the working example and it went off flawlessly. Let me ask you this: I am using multiple sensors to record my training data. What effect will it have if my training images have different aspect ratios and resolutions?

glenn-jocher commented 5 years ago

@bigrobinson it shouldn't have any effect as for example the sources in coco come from various devices and resolutions etc. It actually improves the generalization ability of the network in real world scenarios if the data is sourced from a variety of places.

Typically custom training problems are due to formatting issues. Make sure that your training data is labelled and structured the exact same way as the coco dataset.

bigrobinson commented 5 years ago

The error is due to calculation of negative border dimensions in the letterbox method of datasets.py causing an exception to be thrown by the call to cv2.copyMakeBorder, when rectangular training is set to TRUE. Note that rectangular training is set to FALSE by default in train.py:

# Dataset
    rectangular_training = False
    dataset = LoadImagesAndLabels(train_path,
                                  img_size,
                                  batch_size,
                                  augment=True,
                                  rect=rectangular_training)

Whereas it is set to TRUE by default in test.py (which is called by train.py when called as main):

dataset = LoadImagesAndLabels(test_path, img_size, batch_size)  # Note rect=True by default
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            num_workers=4,
                            pin_memory=True,
                            collate_fn=dataset.collate_fn)

When I set rect=False in the call to LoadImagesAndLabels in test.py, the problem is resolved. On the other hand, when I set rectangular training to TRUE in train.py, the problem appears again when cv2.copyMakeBorder is called. It appears to be a bug in the letterbox padding calculation.

glenn-jocher commented 5 years ago

@bigrobinson yes, rectangular training is still very much in development, and is not currently compatible with multi-scale training for example. If you set rect=False in both data loaders this should resolve your issue.

If you can more clearly determine the exact size of the images and the padding that is causing the issue this might help also.

sanazss commented 5 years ago

My test data are only 4 images and i get 0 mAP for a single class problem. my test batch matches the validation set but getting mAP of 0 and all other parameters like P and R and F1 doesnt make sense.

glenn-jocher commented 5 years ago

@sanazss this indicates that no objects were detected above threshold in your test set.

bigrobinson commented 5 years ago

Hey @glenn-jocher I was in Germany a couple of weeks and was going to take a look at the letterbox routine now that I'm back. I see you have made a lot of changes to the code base since then. Looking good. I have re-synced to the master and re-run training with my data and with rect=True. I am no longer getting the negative boundary error I was getting before. Cheers!

glenn-jocher commented 5 years ago

@bigrobinson yes we've been busy with updates. In general you should see better results now, and rectangular training is now compatible with multiscale.

glenn-jocher commented 5 years ago

@bigrobinson I'll go ahead and close this now since the issue seems resolved.

Pari-singh commented 5 years ago

Hi @glenn-jocher I am facing this issue with the updated repository. When I run the 'Reproduce our results' from https://docs.ultralytics.com/yolov5/tutorials/train_custom_data. It seems to run perfectly. However, for only 3 class, when I create new dataset from coco datatsets with labels only for those 3 and change the config file as well accordingly, I get the error after 10th epoch:

10/329 1.57G 5.68 27.5 33 66.2 5 416: 88%|▉| 7/8 [00:01<00:00, 3 10/329 1.57G 5.22 27.3 30.7 63.2 18 416: 88%|▉| 7/8 [00:02<00:00, 3 10/329 1.57G 5.22 27.3 30.7 63.2 18 416: 100%|█| 8/8 [00:02<00:00, 4 10/329 1.57G 5.22 27.3 30.7 63.2 18 416: 100%|█| 8/8 [00:02<00:00, 3.74it/s] Class Images Targets P R mAP@0.5 F1: 0%| | 0/8 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 429, in train() # train normally File "train.py", line 309, in train save_json=final_epoch and epoch > 0 and 'coco.data' in data) File "/home/paridhi/tci/yolov3/test.py", line 63, in test for batch_i, (imgs, targets, paths, shapes) in enumerate(tqdm(dataloader, desc=s)): File "/home/paridhi/my_project/lib/python3.5/site-packages/tqdm/std.py", line 1091, in iter for obj in iterable: File "/home/paridhi/my_project/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/home/paridhi/my_project/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/home/paridhi/my_project/lib/python3.5/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) cv2.error: Caught error in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/paridhi/my_project/lib/python3.5/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/home/paridhi/my_project/lib/python3.5/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/paridhi/my_project/lib/python3.5/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/paridhi/tci/yolov3/utils/datasets.py", line 426, in getitem img, ratio, padw, padh = letterbox(img, self.batch_shapes[self.batch[index]], mode='rect') File "/home/paridhi/tci/yolov3/utils/datasets.py", line 642, in letterbox img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color) # add border cv2.error: OpenCV(4.1.1) /io/opencv/modules/core/src/copy.cpp:1196: error: (-215:Assertion failed) top >= 0 && bottom >= 0 && left >= 0 && right >= 0 && _src.dims() <= 2 in function 'copyMakeBorder'

glenn-jocher commented 5 years ago

@Pari-singh there's probably something wrong with one of your images then. You should run in debug mode so you can capture the values being passes to the cv2 function and try to find the image responsible.

We routinely train on custom datasets for our clients without issue.

akbari59 commented 4 years ago

Hello all, Anyone found a solution for this problem or any hit that what would be the problem. I have the same problem for training on a custom dataset.

glenn-jocher commented 1 year ago

@akbari59 as mentioned earlier, you should run the training in debug mode to capture the values being passed to the cv2 function and try to identify the problematic image. This will likely help pinpoint the issue.

ultralytics / yolov3

OpenCV error upon mAP calculation #343