Closed MapleToFu24 closed 1 month ago
Does someone met this error before .I've been trying to fix for half a day.
Can anyone tell me how to fix this error?
Initializing -> world_size: 3 rank: 2 in_node_size: 3 in_node_rank: 2 Initializing -> world_size: 3 rank: 1 in_node_size: 3 in_node_rank: 1 Initializing -> world_size: 3 rank: 0 in_node_size: 3 in_node_rank: 0 Output folder: output/ [27/09 11:50:56] Loading cameras from disk... [27/09 11:50:57] 100%|███████████████████████████████████████| 261/261 [00:00<00:00, 3700.51it/s] [NOTE]: Preloading dataset(4.525125345GB) to GPU. Disable local_sampling and distributed_dataset_storage.[NOTE]: Preloading dataset(4.525125345GB) to GPU. Disable local_sampling and distributed_dataset_storage. [27/09 11:50:57] Decoding Training Cameras [27/09 11:50:57] [27/09 11:50:57] [NOTE]: Preloading dataset(4.525125345GB) to GPU. Disable local_sampling and distributed_dataset_storage. [27/09 11:50:57] 100%|█████████████████████████████████████████| 261/261 [00:12<00:00, 21.48it/s] 100%|████████████████████████████████████████| 261/261 [00:01<00:00, 211.36it/s] Number of points before initialization : 61199 [27/09 11:51:12] Training progress: 23%|▋ | 7000/30000 [05:49<18:08, 21.13it/s, Loss=0.0614306] [ITER 6997] Start Testing [27/09 11:57:02] [rank1]: Traceback (most recent call last): [rank1]: File "train.py", line 82, in <module> [rank1]: train_internal.training( [rank1]: File "/home/super/Grendel-GS/train_internal.py", line 242, in training [rank1]: training_report( [rank1]: File "/home/super/Grendel-GS/train_internal.py", line 370, in training_report [rank1]: {"name": "test", "cameras": scene.getTestCameras(), "num_cameras": len(scene.getTestCameras())}, [rank1]: TypeError: object of type 'NoneType' has no len() [rank0]: Traceback (most recent call last): [rank0]: File "train.py", line 82, in <module> [rank0]: train_internal.training( [rank0]: File "/home/super/Grendel-GS/train_internal.py", line 242, in training [rank0]: training_report( [rank0]: File "/home/super/Grendel-GS/train_internal.py", line 370, in training_report [rank0]: {"name": "test", "cameras": scene.getTestCameras(), "num_cameras": len(scene.getTestCameras())}, [rank0]: TypeError: object of type 'NoneType' has no len() [rank2]: Traceback (most recent call last): [rank2]: File "train.py", line 82, in <module> [rank2]: train_internal.training( [rank2]: File "/home/super/Grendel-GS/train_internal.py", line 242, in training [rank2]: training_report( [rank2]: File "/home/super/Grendel-GS/train_internal.py", line 370, in training_report [rank2]: {"name": "test", "cameras": scene.getTestCameras(), "num_cameras": len(scene.getTestCameras())}, [rank2]: TypeError: object of type 'NoneType' has no len() Training progress: 23%|▋ | 7000/30000 [05:49<19:09, 20.01it/s, Loss=0.0614306] W0927 11:57:03.402784 140656837973056 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 733292 closing signal SIGTERM W0927 11:57:03.403337 140656837973056 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 733294 closing signal SIGTERM E0927 11:57:03.582679 140656837973056 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 733293) of binary: /home/super/miniconda3/envs/G-GS/bin/python3.8 Traceback (most recent call last): File "/home/super/miniconda3/envs/G-GS/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-09-27_11:57:03 host : localhost rank : 1 (local_rank: 1) exitcode : 1 (pid: 733293) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Hi, have you enabled --eval ?
--eval
oh, its my mistack, I forget the --eval, I just follow the main GS Grendel-GS is working as good now. Thanks
Does someone met this error before .I've been trying to fix for half a day.
Can anyone tell me how to fix this error?