nyu-systems / Grendel-GS

Ongoing research training gaussian splatting at scale by distributed system
Apache License 2.0
380 stars 20 forks source link

Error about training GS TypeError: object of type 'NoneType' has no len() #28

Closed MapleToFu24 closed 1 month ago

MapleToFu24 commented 1 month ago

Does someone met this error before .I've been trying to fix for half a day.

Can anyone tell me how to fix this error?

Initializing ->  world_size: 3 rank: 2     in_node_size: 3 in_node_rank: 2
Initializing ->  world_size: 3 rank: 1     in_node_size: 3 in_node_rank: 1
Initializing ->  world_size: 3 rank: 0     in_node_size: 3 in_node_rank: 0
Output folder: output/ [27/09 11:50:56]
Loading cameras from disk... [27/09 11:50:57]
100%|███████████████████████████████████████| 261/261 [00:00<00:00, 3700.51it/s]
[NOTE]: Preloading dataset(4.525125345GB) to GPU. Disable local_sampling and distributed_dataset_storage.[NOTE]: Preloading dataset(4.525125345GB) to GPU. Disable local_sampling and distributed_dataset_storage. [27/09 11:50:57]
Decoding Training Cameras [27/09 11:50:57]
 [27/09 11:50:57]
[NOTE]: Preloading dataset(4.525125345GB) to GPU. Disable local_sampling and distributed_dataset_storage. [27/09 11:50:57]
100%|█████████████████████████████████████████| 261/261 [00:12<00:00, 21.48it/s]
100%|████████████████████████████████████████| 261/261 [00:01<00:00, 211.36it/s]
Number of points before initialization :  61199 [27/09 11:51:12]
Training progress:  23%|▋  | 7000/30000 [05:49<18:08, 21.13it/s, Loss=0.0614306]
[ITER 6997] Start Testing [27/09 11:57:02]
[rank1]: Traceback (most recent call last):
[rank1]:   File "train.py", line 82, in <module>
[rank1]:     train_internal.training(
[rank1]:   File "/home/super/Grendel-GS/train_internal.py", line 242, in training
[rank1]:     training_report(
[rank1]:   File "/home/super/Grendel-GS/train_internal.py", line 370, in training_report
[rank1]:     {"name": "test", "cameras": scene.getTestCameras(), "num_cameras": len(scene.getTestCameras())},
[rank1]: TypeError: object of type 'NoneType' has no len()
[rank0]: Traceback (most recent call last):
[rank0]:   File "train.py", line 82, in <module>
[rank0]:     train_internal.training(
[rank0]:   File "/home/super/Grendel-GS/train_internal.py", line 242, in training
[rank0]:     training_report(
[rank0]:   File "/home/super/Grendel-GS/train_internal.py", line 370, in training_report
[rank0]:     {"name": "test", "cameras": scene.getTestCameras(), "num_cameras": len(scene.getTestCameras())},
[rank0]: TypeError: object of type 'NoneType' has no len()
[rank2]: Traceback (most recent call last):
[rank2]:   File "train.py", line 82, in <module>
[rank2]:     train_internal.training(
[rank2]:   File "/home/super/Grendel-GS/train_internal.py", line 242, in training
[rank2]:     training_report(
[rank2]:   File "/home/super/Grendel-GS/train_internal.py", line 370, in training_report
[rank2]:     {"name": "test", "cameras": scene.getTestCameras(), "num_cameras": len(scene.getTestCameras())},
[rank2]: TypeError: object of type 'NoneType' has no len()
Training progress:  23%|▋  | 7000/30000 [05:49<19:09, 20.01it/s, Loss=0.0614306]
W0927 11:57:03.402784 140656837973056 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 733292 closing signal SIGTERM
W0927 11:57:03.403337 140656837973056 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 733294 closing signal SIGTERM
E0927 11:57:03.582679 140656837973056 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 733293) of binary: /home/super/miniconda3/envs/G-GS/bin/python3.8
Traceback (most recent call last):
  File "/home/super/miniconda3/envs/G-GS/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/super/miniconda3/envs/G-GS/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-27_11:57:03
  host      : localhost
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 733293)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
TarzanZhao commented 1 month ago

Hi, have you enabled --eval ?

MapleToFu24 commented 1 month ago

Hi, have you enabled --eval ?

oh, its my mistack, I forget the --eval, I just follow the main GS Grendel-GS is working as good now. Thanks