Closed jiaminglei-lei closed 3 years ago
It seems you tried to dump dict objects in the pickle format. As your provided traceback shows, it is likely to happen during data preprocessing. Have you ever changed anything in the data pipeline? And when does this error appear specifically (before successfully train any one step)?
@Tai-Wang thanks for your reply. I didn't change anything in the data pipeline. I debugged and found it may be the problem of multi processes. I changed the 'workers_per_gpu' from '2' to '0', and this disables multi process, then it works. But I don't know why.......
Oh, that's really strange. Can you successfully train other models with multi-gpu or multi-process?
@Tai-Wang Single-gpu with multi-process is ok. But multi-gpu with multi-process is not work. Maybe some problems related to the running system?
Then it should not be a problem for specific models or datasets. I guess it is related to your machine setting, for example, the limitation of num_workers usage or something.
@jiaminglei-lei I wonder if you have solved this problem yet. Cause I met the same problem in the latest version of mmdet3d.
@jiaminglei-lei I'am having the same problem. I wonder if you solved this problem
@GeonHoBang @gujiaqivadin sry. i can't solve this problem, i still suppose that it's something related to the running machine cause i can run it successfully in another machine.
@gujiaqivadin @GeonHoBang @jiaminglei-lei I'am having the same problem. Did you solve it please? We have two servers with the same configuration here, one works, the other doesn't. Just for the nuscenes dataset, there is no problem with the kitti dataset。
It can be solved by add following code before training
`
if name == 'main':
torch.multiprocessing.set_start_method('fork')
main()
`
I would like to share a "root cause" solution here, hope it can help others.
The error information shows, there is a data that pickle lib can't dump, which is a key step for map-reduce pipeline of multi-processing training(actually, it's the internal data loading step). Although seeing the error log, we could only know there is data with "dict_keys" type, not showing the data name, the data trace information. The difficulty is, what's the data that can't dump, and how to find it and replace to another supportable format for pickle dump.
Debug method: Using "dill lib": https://stackoverflow.com/questions/30499341/establishing-why-an-object-cant-be-pickled
With the tool, I add the code into the raise-error code in reduction.py and get the results:
Then, we find the dict_keys() variable inside the object of nuscenes.eval.detection.data_classes.DetectionConfig.
Take a closer look:
class DetectionConfig:
""" Data class that specifies the detection evaluation settings. """
def __init__(self,
class_range: Dict[str, int],
dist_fcn: str,
dist_ths: List[float],
dist_th_tp: float,
min_recall: float,
min_precision: float,
max_boxes_per_sample: int,
mean_ap_weight: int):
assert set(class_range.keys()) == set(DETECTION_NAMES), "Class count mismatch."
assert dist_th_tp in dist_ths, "dist_th_tp must be in set of dist_ths."
self.class_range = class_range
self.dist_fcn = dist_fcn
self.dist_ths = dist_ths
self.dist_th_tp = dist_th_tp
self.min_recall = min_recall
self.min_precision = min_precision
self.max_boxes_per_sample = max_boxes_per_sample
self.mean_ap_weight = mean_ap_weight
self.class_names = self.class_range.keys()
We find the problematic code:
self.class_names = self.class_range.keys()
This code is from nuscenes official lib, I'm not sure whether they have noticed the potential problem.
But for us, the quick solution is to add a list() function for it:
self.class_names = list(self.class_range.keys())
Then, the problem is solved.
Describe the bug When distributed training with customized model in NuScenes dataset, "TypeError: can't pickle dict_keys objects" occured.
Reproduction
Environment
python mmdet3d/utils/collect_env.py
to collect necessary environment infomation and paste it here.TorchVision: 0.6.0a0+35d732a OpenCV: 4.5.3 MMCV: 1.3.9 MMCV Compiler: GCC 5.2 MMCV CUDA Compiler: 10.2 MMDetection: 2.14.0 MMSegmentation: 0.14.1 MMDetection3D: 0.15.0+
Traceback (most recent call last): File "./tools/train.py", line 229, in
main()
File "./tools/train.py", line 225, in main
meta=meta)
File "/home/leijiaming/Code/rrpn-mmdetection3d/mmdet3d/apis/train.py", line 34, in train_model
meta=meta)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train
for i, data_batch in enumerate(self.data_loader):
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in iter
return _MultiProcessingDataLoaderIter(self)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in init
w.start()
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)