nianticlabs / mickey

[CVPR 2024 - Oral] Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences
https://nianticlabs.github.io/mickey/
Other
433 stars 28 forks source link

Errors in multi-gpu training #13

Closed XJTU-Haolin closed 16 hours ago

XJTU-Haolin commented 2 months ago

When I ran multi-gpu training of Mikey using 4*3090, I met the following errors. I never meet such problems when using one GPU. It seems that something wrong with the JPEG images, but the map-free datasets were downloaded without any processing.

./train.sh: line 1: 23 Killed python3 train.py [rank: 3] Child process with PID 27 terminated with code -9. Forcefully terminating all other processes to avoid zombies 🧟 RuntimeError: DataLoader worker (pid 2655) is killed by signal: Killed. _error_if_any_worker_fails() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler transform = torch.eye(3) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/utils.py", line 92, in correct_intrinsic_scale K = correct_intrinsic_scale(K, resize[0] / W, resize[1] / H) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 47, in read_intrinsics self.K, self.K_ori = self.read_intrinsics(self.scene_root, resize) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 26, in init MapFreeScene( File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 191, in data_srcs = [ File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 190, in init dataset = self.dataset_type(self.cfg, 'val') File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/datamodules.py", line 107, in val_dataloader return fn(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 179, in _call_lightning_datamodule_hook return call._call_lightning_datamodule_hook(self.instance.trainer, self.name) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 309, in dataloader return data_source.dataloader() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 342, in _request_dataloader dataloaders = _request_dataloader(source) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 166, in setup_data self.epoch_loop.val_loop.setup_data() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 324, in on_run_start self.on_run_start() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run self.fit_loop.run() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage results = self._run_stage() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl return function(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt call._call_and_handle_interrupt( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit trainer.fit(model, datamodule_end, ckpt_path=ckpt_path) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/train.py", line 89, in train_model train_model(args) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/train.py", line 99, in Traceback (most recent call last): Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Training with 0.00/1.00 image overlap

Could you give me any instructions?

Thanks for your time!

Haolin

axelBarroso commented 1 month ago

Hello, sorry for the late response. Is this problem solved? I couldn't replicate it.

It seems that the problem is on the validation data, and not on the training. Have you verified that the path to the validation images and intrinsics is correct?

XJTU-Haolin commented 3 weeks ago

Hello, sorry for the late response. Is this problem solved? I couldn't replicate it.

It seems that the problem is on the validation data, and not on the training. Have you verified that the path to the validation images and intrinsics is correct?

I will check it again. Thanks for your reply!

axelBarroso commented 16 hours ago

Closing this error since it has not been active for a while. Do please reopen if you find any other problems. Thanks!