Can't pickle local object while training RandLANet on S3DIS

kimdn commented 2 years ago

Checklist

[X] I have searched for similar issues.
[X] I have tested with the latest development wheel.
[X] I have checked the release documentation and the latest documentation (for master branch).

Describe the issue

Can't pickle local object while training RandLANet on S3DIS. I use pytorch.

Steps to reproduce the bug

import open3d.ml as _ml3d
import open3d.ml.torch as ml3d

model = ml3d.models.RandLANet()

dataset_path = "/Users/kimd999/research/projects/Danny/files/public_dataset/S3DIS/Stanford3dDataset_v1.2_Aligned_Version"
dataset = ml3d.datasets.S3DIS(dataset_path=dataset_path, use_cache=True)

pipeline = ml3d.pipelines.SemanticSegmentation(model=model, dataset=dataset, max_epoch=100)

# prints training progress in the console.
pipeline.run_train()

Error message

INFO - 2022-02-15 13:24:10,927 - semantic_segmentation - DEVICE : cpu INFO - 2022-02-15 13:24:10,927 - semantic_segmentation - Logging in file : ./logs/RandLANet_S3DIS_torch/log_train_2022-02-15_13:24:10.txt INFO - 2022-02-15 13:24:10,929 - s3dis - Found 249 pointclouds for train INFO - 2022-02-15 13:24:10,935 - s3dis - Found 23 pointclouds for validation INFO - 2022-02-15 13:24:10,937 - semantic_segmentation - Initializing from scratch. INFO - 2022-02-15 13:24:10,940 - semantic_segmentation - Writing summary in train_log/00003_RandLANet_S3DIS_torch. INFO - 2022-02-15 13:24:10,940 - semantic_segmentation - Started training INFO - 2022-02-15 13:24:10,940 - semantic_segmentation - === EPOCH 0/100 === training: 0%| | 0/63 [00:00<?, ?it/s] Traceback (most recent call last): File "train_model_for_semantic_segmentation.py", line 19, in pipeline.run_train() File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py", line 394, in run_train for step, inputs in enumerate(tqdm(train_loader, desc='training')): File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/tqdm/std.py", line 1180, in iter for obj in iterable: File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 355, in iter return self._get_iterator() File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 914, in init w.start() File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/Users/kimd999/bin/miniconda3/envs/open3d/lib/python3.8/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'SemSegRandomSampler.get_point_sampler.._random_centered_gen'

Expected behavior

No response

Open3D, Python and System information

- Operating system: OSX 10.15.7
- Python version: Python 3.8.12
- Open3D version: open3d version:0.14.1
- System type: x84
- Is this remote workstation?: no
- How did you install Open3D?: pip install open3d

Additional information

No response

bernhardpg commented 2 years ago

I am having the exact same issue, also with RandLANet and SemanticSegmentation. I will let you know if I find the problem @kimdn .

bernhardpg commented 2 years ago

@kimdn Seems that this was caused somehow by num_workers in pytorch, see this thread: https://github.com/pyg-team/pytorch_geometric/issues/366.

Try setting num_workers=0 in your pipeline definition like so: pipeline = ml3d.pipelines.SemanticSegmentation(model=model, dataset=dataset, max_epoch=100, num_workers=0)

I guess it is not a great solution if you intend to have num_workers > 0, but hopefully it will at least resolve the error message!

maosuli commented 2 years ago

I used WSL ubuntu to train the models. Num_workers > 0 worked for RandLA-Net but KPConv, which was very strange. But at least it proved that multiprocessing could work in this virtual environment. Do you have any ideas about the difference in the model deployments?

whuhxb commented 1 year ago

Hi @bernhardpg @LuZaiJiaoXiaL

I have set num_workers to 0, but I still met this bug. Do you know how to solve? python scripts/run_pipeline.py torch -c ml3d/configs/randlanet_toronto3d.yml --dataset.dataset_path dataset/Toronto_3D --pipeline SemanticSegmentation --dataset.use_cache True --num_workers 0

INFO - 2022-12-09 17:31:29,220 - semantic_segmentation - === EPOCH 0/200 === training: 0%| | 0/50 [00:00<?, ?it/s] Traceback (most recent call last): File "/export/home2/hanxiaobing/Documents/Open3D-ML-code/Open3D-ML/scripts/run_pipeline.py", line 246, in sys.exit(main()) File "/export/home2/hanxiaobing/Documents/Open3D-ML-code/Open3D-ML/scripts/run_pipeline.py", line 180, in main pipeline.run_train() File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py", line 406, in run_train for step, inputs in enumerate(tqdm(train_loader, desc='training')): File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 438, in iter return self._get_iterator() File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1048, in init w.start() File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/context.py", line 291, in _Popen return Popen(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in init super().init(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/popen_forkserver.py", line 47, in _launch reduction.dump(process_obj, buf) File "/export/home2/hanxiaobing/anaconda3/envs/Open3D-ML-Pytorch/lib/python3.10/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'SemSegRandomSampler.get_point_sampler.._random_centered_gen' [W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

ted8201 commented 1 year ago

I set num_workers to 0 in run_pipline.py, the same error still happens.

RauchLukas commented 1 year ago

Hey, any new insights into how to fix the problem? I just ran into the same issue on a dockerized ubuntu20.04 with cudnn11.7.

Happy, if someone could share their latest fixes

runra commented 11 months ago

Same problem here. Very keen to get this fixed if I can.

Thanks

DCtcl commented 6 months ago

Hi, I find a solution. Just add "num_workers:0 pin_memory: false" belong to "pipeline" in ".yaml " config file. Solution link https://blog.csdn.net/weixin_40653140/article/details/130492849

isl-org / Open3D-ML