Errors in distributed training with NuScenes dataset

jiaminglei-lei commented 3 years ago

Describe the bug When distributed training with customized model in NuScenes dataset, "TypeError: can't pickle dict_keys objects" occured.

Reproduction

What command or script did you run?

CUDA_VISIBLE_DEVICES=1,2 bash ./tools/dist_train.sh {config_path} 2

Did you make any modifications on the code or config? Did you understand what you have modified?
What dataset did you use? - NuScenes.

Environment

Please run python mmdet3d/utils/collect_env.py to collect necessary environment infomation and paste it here.


sys.platform: linux
Python: 3.7.10 (default, Jun  4 2021, 14:48:32) [GCC 7.5.0]
CUDA available: True
GPU 0,1: TITAN Xp
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (GCC) 5.2.0
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.0a0+35d732a OpenCV: 4.5.3 MMCV: 1.3.9 MMCV Compiler: GCC 5.2 MMCV CUDA Compiler: 10.2 MMDetection: 2.14.0 MMSegmentation: 0.14.1 MMDetection3D: 0.15.0+


**Error traceback**

Traceback (most recent call last): File "./tools/train.py", line 229, in main() File "./tools/train.py", line 225, in main meta=meta) File "/home/leijiaming/Code/rrpn-mmdetection3d/mmdet3d/apis/train.py", line 34, in train_model meta=meta) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train for i, data_batch in enumerate(self.data_loader): File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in iter return _MultiProcessingDataLoaderIter(self) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in init w.start() File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/leijiaming/anaconda3/envs/mmdetection3d/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj)



it worked successfully in single gpu, so i think the customized model is not the key of this problem.
I found the same error in one issue before, but no solution, so i post a new issue.

Tai-Wang commented 3 years ago

It seems you tried to dump dict objects in the pickle format. As your provided traceback shows, it is likely to happen during data preprocessing. Have you ever changed anything in the data pipeline? And when does this error appear specifically (before successfully train any one step)?

jiaminglei-lei commented 3 years ago

@Tai-Wang thanks for your reply. I didn't change anything in the data pipeline. I debugged and found it may be the problem of multi processes. I changed the 'workers_per_gpu' from '2' to '0', and this disables multi process, then it works. But I don't know why.......

Tai-Wang commented 3 years ago

Oh, that's really strange. Can you successfully train other models with multi-gpu or multi-process?

jiaminglei-lei commented 3 years ago

@Tai-Wang Single-gpu with multi-process is ok. But multi-gpu with multi-process is not work. Maybe some problems related to the running system?

Tai-Wang commented 3 years ago

Then it should not be a problem for specific models or datasets. I guess it is related to your machine setting, for example, the limitation of num_workers usage or something.

gujiaqivadin commented 2 years ago

@jiaminglei-lei I wonder if you have solved this problem yet. Cause I met the same problem in the latest version of mmdet3d.

geonhobang commented 2 years ago

@jiaminglei-lei I'am having the same problem. I wonder if you solved this problem

jiaminglei-lei commented 2 years ago

@GeonHoBang @gujiaqivadin sry. i can't solve this problem, i still suppose that it's something related to the running machine cause i can run it successfully in another machine.

zhouzhubin commented 2 years ago

@gujiaqivadin @GeonHoBang @jiaminglei-lei I'am having the same problem. Did you solve it please? We have two servers with the same configuration here, one works, the other doesn't. Just for the nuscenes dataset, there is no problem with the kitti dataset。

zhaokai5 commented 2 years ago

It can be solved by add following code before training

`

if name == 'main':

torch.multiprocessing.set_start_method('fork')
main()

`

qijiez commented 2 years ago

I would like to share a "root cause" solution here, hope it can help others.

The error information shows, there is a data that pickle lib can't dump, which is a key step for map-reduce pipeline of multi-processing training(actually, it's the internal data loading step). Although seeing the error log, we could only know there is data with "dict_keys" type, not showing the data name, the data trace information. The difficulty is, what's the data that can't dump, and how to find it and replace to another supportable format for pickle dump.

Debug method: Using "dill lib": https://stackoverflow.com/questions/30499341/establishing-why-an-object-cant-be-pickled

With the tool, I add the code into the raise-error code in reduction.py and get the results: 选区_093

Then, we find the dict_keys() variable inside the object of nuscenes.eval.detection.data_classes.DetectionConfig.

Take a closer look:

class DetectionConfig:
    """ Data class that specifies the detection evaluation settings. """

    def __init__(self,
                 class_range: Dict[str, int],
                 dist_fcn: str,
                 dist_ths: List[float],
                 dist_th_tp: float,
                 min_recall: float,
                 min_precision: float,
                 max_boxes_per_sample: int,
                 mean_ap_weight: int):

        assert set(class_range.keys()) == set(DETECTION_NAMES), "Class count mismatch."
        assert dist_th_tp in dist_ths, "dist_th_tp must be in set of dist_ths."

        self.class_range = class_range
        self.dist_fcn = dist_fcn
        self.dist_ths = dist_ths
        self.dist_th_tp = dist_th_tp
        self.min_recall = min_recall
        self.min_precision = min_precision
        self.max_boxes_per_sample = max_boxes_per_sample
        self.mean_ap_weight = mean_ap_weight

        self.class_names = self.class_range.keys()

We find the problematic code: self.class_names = self.class_range.keys()

This code is from nuscenes official lib, I'm not sure whether they have noticed the potential problem. But for us, the quick solution is to add a list() function for it: self.class_names = list(self.class_range.keys())

Then, the problem is solved.

open-mmlab / mmdetection3d

Errors in distributed training with NuScenes dataset #890