Closed pcicales closed 3 years ago
I was able to resolve the above issue by adding the following lines in my config file (to deal with the RLE format from bdd100k):
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True, poly2mask=True)
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True, poly2mask=True)
]
However, I am now getting the following error:
RecursionError: maximum recursion depth exceeded
I am not sure what to make of that; I could simply increase the recursion depth, but is this a common problem? What do you think @hhaAndroid ?
Sorry I did not mean to close it..
I fixed the above problem by using the following in the train.py file:
import sys
sys.setrecursionlimit(20000)
However I am now getting a segmentation fault. I went through the seg fault walkthrough (GCC versions, installation checks, etc.) and everything checks out. Here is the traceback:
Traceback (most recent call last):
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/multiprocessing/connection.py", line 921, in wait
ready = selector.select(timeout)
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 197954) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "tools/train.py", line 190, in <module>
main()
File "tools/train.py", line 186, in main
meta=meta)
File "/red/workspace/pycharm_projects/mmdetection/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train
for i, data_batch in enumerate(self.data_loader):
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
success, data = self._try_get_data()
File "/red/anaconda3/envs/AMB_BDD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 197954) exited unexpectedly
I am trying to understand what went wrong. At a bit of a loss.
It seems that there is a problem with the data conversion. Can you first use the browse_dataset script to visualize whether all the data is correct?
I am running it now, but it is quite slow as it is about 30,000 tasks. It has so far successfully gone through ~100 tasks, from what I can tell in the code it means that this ~100 tasks had clean coco style annotations. Is that correct? Perhaps there may be a small number of images with faulty annotations in my dataset, which is triggering the error?
Hi @hhaAndroid , going deeper into the issue I am now looking at the .json files. It seems that there are extra elements and a single element missing; my files contain all of the correct elements listed here, but are missing only supercategory under categories. I also have extra elements, which include the videos group, along with instance_id, scalabel_id (specific to bdd100k) and ignore under annotations. My first question is could this be causing the issue? And my second question would then be why, since the other elements (aside from supercategory) are all present and correctly assigned; wouldnt extra elements just be ignored?
@hhaAndroid I competed running your code and there were no errors. Im going line by line to see what is going on. Do you have any other advice? I also edited the json files to be identical to what is described in the readme.
@hhaAndroid I competed running your code and there were no errors. Im going line by line to see what is going on. Do you have any other advice? I also edited the json files to be identical to what is described in the readme.
The most important problem is to find out the wrong image and annotations, you can try it.
I took some time to go through this issue very carefully; it appears that this problem is related to the data loader deadlock issue that has persisted through the last several releases of pytorch. I believe that your group encountered this issue before, as I see several lines in your code that attempt to reduce the frequency of this issue (e.g. line 63 of epoch_based_runner, time.sleep(2) # Prevent possible deadlock
). There may be a more direct solution to this problem, but it seems like this is just something that will need to be addressed by the pytorch dev team.
To solve this problem on my end, I simply run the dist_train.sh file from the repo across multiple gpus; I am not entirely sure why this fixes it, perhaps someone could elaborate. Additionally, my training would have exceeded my available gpu memory on a single gpu, so perhaps that is related (although I would expect to see a different traceback, also the seg fault happens before transferring data/annotations to gpu memory based on what I am seeing in nvidia-smi).
Thank you for your help @hhaAndroid ; if your group is interested in looking into this further, there is an extensive thread that discusses this issue here that has been updated for about 3 years. I believe that is a good place to start, additionally increasing the time in lines like line 63 of epoch_based_runner, time.sleep(2) # Prevent possible deadlock
may help ameliorate the issue in cases like mine.
@pcicales Thank you
Hello,
I constructed my config file as follows using the guide provided; I use absolute paths as my data is not located in the project dir:
Keep in mind that I converted the BDD annotations to COCO using the scripts provided at https://github.com/bdd100k/bdd100k
The above code yields the following traceback:
Any advice would be appreciated, I am going through the trace to understand what happened now.