ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.16k stars 3.44k forks source link

test.py error upon 'negative examples' (completely empty batches) #414

Closed bigrobinson closed 4 years ago

bigrobinson commented 5 years ago

Description of the bug I have four classes along with negative examples, with training set about evenly split with 4000 images total, and test block also evenly split with about 500 images. (Train/test/validation split is 80/10/10 all cases.) If I exclude negative examples, it trains just fine and will calculate mAP in every epoch. When I include the negative examples, it trains but it will not calculate mAP without giving the error below:

Screen output

Namespace(accumulate=2, batch_size=32, bucket='', cfg='/home/brian/yolov3_ultralytics/data/RawTrainingData/yolov3-spp_uxo.cfg', data='/home/brian/yolov3_ultralytics/data/RawTrainingData/Uxo.data', epochs=273, evolve=False, img_size=416, img_weights=False, multi_scale=False, nosave=False, notest=False, num_workers=4, rect=False, resume=False, transfer=False, xywh=False)
Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11175MB)
           device1 _CudaDeviceProperties(name='GeForce GTX 1080', total_memory=8119MB)

Model Summary: 225 layers, 6.25895e+07 parameters, 6.25895e+07 gradients

     Epoch   gpu_mem   GIoU/xy        wh       obj       cls     total   targets  img_size
     0/272     2.42G      1.23         0      1.49      17.4      20.1         5       416: 100%|████████████████████████████████████████████████████████████████████████████████| 127/127 [11:59<00:00,  4.17s/it]
Reading image shapes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 502/502 [00:07<00:00, 65.69it/s]
                         Class    Images   Targets         P         R       mAP        F1:  75%|█████████████████████████████████████████████████████████████▌                    | 12/16 [01:20<00:16,  4.03s/it]Traceback (most recent call last):
  File "/home/brian/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/home/brian/anaconda3/lib/python3.7/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/brian/anaconda3/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/brian/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 18316) is killed by signal: Killed. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 362, in <module>
    accumulate=opt.accumulate)
  File "train.py", line 288, in train
    conf_thres=0.1)
  File "/home/brian/yolov3/test.py", line 61, in test
    for batch_i, (imgs, targets, paths, shapes) in enumerate(tqdm(dataloader, desc=s)):
  File "/home/brian/anaconda3/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1005, in __iter__
    for obj in iterable:
  File "/home/brian/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 576, in __next__
    idx, batch = self._get_batch()
  File "/home/brian/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 543, in _get_batch
    success, data = self._try_get_batch()
  File "/home/brian/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 18316) exited unexpectedly

Steps to reproduce the behavior:

  1. Divide data into train and test directories and assemble .data text files using custom python module
  2. Execute train.py with above switches set from root directory
  3. Wait for first epoch to finish and mAP calculation to start
  4. See error around time when negative examples are reached (test data does not appear to be shuffled)

Expected behavior I expect it to train for 272 epochs and track the mAP for each epoch

Desktop:

glenn-jocher commented 5 years ago

@bigrobinson by negative examples you mean images with no labels? Do you have entire batches of images with no labels?

We've trained with empty images, but not whole batches of empty images I believe.

No you are correct, the test set is not shuffled, as it would not affect the result.

bigrobinson commented 5 years ago

@glenn-jocher yes I do mean images with no labels, and yes I have entire batches of images with no labels.

glenn-jocher commented 5 years ago

@bigrobinson this is a new use case we haven't seen before. If you managed to debug your issue please submit a PR. Otherwise we will leave this issue open until we implement a fix, though I can't provide you a timeline, since being the only user reporting the error it will be low on the priority list.

glenn-jocher commented 4 years ago

@bigrobinson I think this issue should be resolved now, as we've been training a custom dataset with many (about a thousand) negative example images tagged onto the end, and like you said test.py does not randomly shuffle, and we get no errors.

Though we haven't actually looked into the problem, we simply don't see it on our negative datasets. In our example our negative example images don't have label files.