open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.59k stars 9.46k forks source link

Using this repo with the BDD100k seg_track_20 dataset #5261

Closed pcicales closed 3 years ago

pcicales commented 3 years ago

Hello,

I constructed my config file as follows using the guide provided; I use absolute paths as my data is not located in the project dir:

# The new config inherits a base config to highlight the necessary modification
_base_ = '/red/workspace/pycharm_projects/mmdetection/configs/mask_rcnn/mask_rcnn_r50_caffe_fpn_mstrain-poly_1x_coco.py'

# We also need to change the num_classes in head to match the dataset's annotation
model = dict(
roi_head=dict(
bbox_head=dict(num_classes=8),
mask_head=dict(num_classes=8)))

# Modify dataset related settings
dataset_type = 'COCODataset'
classes = ('pedestrian', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', 'bicycle', )

data = dict(
train=dict(
img_prefix='/red/datasets/bdd100k/images/seg_track_20/train/',
classes=classes,
ann_file='/red/datasets/bdd100k/labels/seg_track_20_coco/seg_track_20_train.json'),
val=dict(
img_prefix='/red/datasets/bdd100k/images/seg_track_20/val/',
classes=classes,
ann_file='/red/datasets/bdd100k/labels/seg_track_20_coco/seg_track_20_val.json'),
test=dict(
img_prefix='/red/datasets/bdd100k/images/seg_track_20/val/',
classes=classes,
ann_file='/red/datasets/bdd100k/labels/seg_track_20_coco/seg_track_20_val.json'))

# For pretraining you can run the following with desired model
# load_from = 'checkpoints/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco_bbox_mAP-0.408__segm_mAP-0.37_20200504_163245-42aa3d00.pth'

Keep in mind that I converted the BDD annotations to COCO using the scripts provided at https://github.com/bdd100k/bdd100k

The above code yields the following traceback:

Traceback (most recent call last):
 File "tools/train.py", line 188, in <module>
   main()
 File "tools/train.py", line 184, in main
   meta=meta)
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/apis/train.py", line 170, in train_detector
   runner.run(data_loaders, cfg.workflow)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
   epoch_runner(data_loaders[i], **kwargs)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train
   for i, data_batch in enumerate(self.data_loader):
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
   data = self._next_data()
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
   return self._process_data(data)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
   data.reraise()
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
   raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
   data = fetcher.fetch(index)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
   data = [self.dataset[idx] for idx in possibly_batched_index]
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
   data = [self.dataset[idx] for idx in possibly_batched_index]
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/datasets/custom.py", line 194, in __getitem__
   data = self.prepare_train_img(idx)
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/datasets/custom.py", line 217, in prepare_train_img
   return self.pipeline(results)
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/datasets/pipelines/compose.py", line 40, in __call__
   data = t(data)
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/datasets/pipelines/loading.py", line 371, in __call__
   results = self._load_masks(results)
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/datasets/pipelines/loading.py", line 326, in _load_masks
   [self.process_polygons(polygons) for polygons in gt_masks], h,
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/datasets/pipelines/loading.py", line 326, in <listcomp>
   [self.process_polygons(polygons) for polygons in gt_masks], h,
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/datasets/pipelines/loading.py", line 303, in process_polygons
   if len(polygon) % 2 == 0 and len(polygon) >= 6:
TypeError: len() of unsized object

Any advice would be appreciated, I am going through the trace to understand what happened now.

pcicales commented 3 years ago

I was able to resolve the above issue by adding the following lines in my config file (to deal with the RLE format from bdd100k):

train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True, poly2mask=True)
]

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True, poly2mask=True)
]

However, I am now getting the following error:

RecursionError: maximum recursion depth exceeded

I am not sure what to make of that; I could simply increase the recursion depth, but is this a common problem? What do you think @hhaAndroid ?

pcicales commented 3 years ago

Sorry I did not mean to close it..

pcicales commented 3 years ago

I fixed the above problem by using the following in the train.py file:

import sys
sys.setrecursionlimit(20000)

However I am now getting a segmentation fault. I went through the seg fault walkthrough (GCC versions, installation checks, etc.) and everything checks out. Here is the traceback:

Traceback (most recent call last):
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
   data = self._data_queue.get(timeout=timeout)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/queues.py", line 104, in get
   if not self._poll(timeout):
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/connection.py", line 257, in poll
   return self._poll(timeout)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
   r = wait([self], timeout)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/connection.py", line 921, in wait
   ready = selector.select(timeout)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/selectors.py", line 415, in select
   fd_event_list = self._selector.poll(timeout)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
   _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 197954) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
 File "tools/train.py", line 190, in <module>
   main()
 File "tools/train.py", line 186, in main
   meta=meta)
 File "/red/workspace/pycharm_projects/mmdetection/mmdet/apis/train.py", line 170, in train_detector
   runner.run(data_loaders, cfg.workflow)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
   epoch_runner(data_loaders[i], **kwargs)
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train
   for i, data_batch in enumerate(self.data_loader):
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
   data = self._next_data()
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
   idx, data = self._get_data()
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
   success, data = self._try_get_data()
 File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
   raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 197954) exited unexpectedly

I am trying to understand what went wrong. At a bit of a loss.

hhaAndroid commented 3 years ago

It seems that there is a problem with the data conversion. Can you first use the browse_dataset script to visualize whether all the data is correct?

pcicales commented 3 years ago

I am running it now, but it is quite slow as it is about 30,000 tasks. It has so far successfully gone through ~100 tasks, from what I can tell in the code it means that this ~100 tasks had clean coco style annotations. Is that correct? Perhaps there may be a small number of images with faulty annotations in my dataset, which is triggering the error?

pcicales commented 3 years ago

Hi @hhaAndroid , going deeper into the issue I am now looking at the .json files. It seems that there are extra elements and a single element missing; my files contain all of the correct elements listed here, but are missing only supercategory under categories. I also have extra elements, which include the videos group, along with instance_id, scalabel_id (specific to bdd100k) and ignore under annotations. My first question is could this be causing the issue? And my second question would then be why, since the other elements (aside from supercategory) are all present and correctly assigned; wouldnt extra elements just be ignored?

pcicales commented 3 years ago

@hhaAndroid I competed running your code and there were no errors. Im going line by line to see what is going on. Do you have any other advice? I also edited the json files to be identical to what is described in the readme.

hhaAndroid commented 3 years ago

@hhaAndroid I competed running your code and there were no errors. Im going line by line to see what is going on. Do you have any other advice? I also edited the json files to be identical to what is described in the readme.

The most important problem is to find out the wrong image and annotations, you can try it.

pcicales commented 3 years ago

I took some time to go through this issue very carefully; it appears that this problem is related to the data loader deadlock issue that has persisted through the last several releases of pytorch. I believe that your group encountered this issue before, as I see several lines in your code that attempt to reduce the frequency of this issue (e.g. line 63 of epoch_based_runner, time.sleep(2) # Prevent possible deadlock). There may be a more direct solution to this problem, but it seems like this is just something that will need to be addressed by the pytorch dev team.

To solve this problem on my end, I simply run the dist_train.sh file from the repo across multiple gpus; I am not entirely sure why this fixes it, perhaps someone could elaborate. Additionally, my training would have exceeded my available gpu memory on a single gpu, so perhaps that is related (although I would expect to see a different traceback, also the seg fault happens before transferring data/annotations to gpu memory based on what I am seeing in nvidia-smi).

Thank you for your help @hhaAndroid ; if your group is interested in looking into this further, there is an extensive thread that discusses this issue here that has been updated for about 3 years. I believe that is a good place to start, additionally increasing the time in lines like line 63 of epoch_based_runner, time.sleep(2) # Prevent possible deadlock may help ameliorate the issue in cases like mine.

hhaAndroid commented 3 years ago

@pcicales Thank you