threestudio-project / threestudio

A unified framework for 3D content generation.
Apache License 2.0
6.32k stars 480 forks source link

Training failure with Stable-Zero123 #482

Open Naxdy opened 5 months ago

Naxdy commented 5 months ago

Currently I'm running into an issue when attempting to train with Stable Zero123. During the first epoch, I'm getting a lot of

[WARNING] Empty rays_indices!

Afterwards, training fails with:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/developer/3d/threestudio/launch.py", line 301, in <module>
[rank2]:     main(args, extras)
[rank2]:   File "/home/developer/3d/threestudio/launch.py", line 244, in main
[rank2]:     trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
[rank2]:     call._call_and_handle_interrupt(
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank2]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]:     return function(*args, **kwargs)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
[rank2]:     self._run(model, ckpt_path=ckpt_path)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
[rank2]:     results = self._run_stage()
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
[rank2]:     self.fit_loop.run()
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank2]:     self.advance()
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank2]:     self.epoch_loop.run(self._data_fetcher)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 141, in run
[rank2]:     self.on_advance_end(data_fetcher)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 295, in on_advance_end
[rank2]:     self.val_loop.run()
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
[rank2]:     return loop_run(self, *args, **kwargs)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 142, in run
[rank2]:     return self.on_run_end()
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 254, in on_run_end
[rank2]:     self._on_evaluation_epoch_end()
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 334, in _on_evaluation_epoch_end
[rank2]:     call._call_lightning_module_hook(trainer, hook_name)
[rank2]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook
[rank2]:     output = fn(*args, **kwargs)
[rank2]:   File "/home/developer/3d/threestudio/threestudio/systems/zero123.py", line 319, in on_validation_epoch_end
[rank2]:     shutil.rmtree(
[rank2]:   File "/nix/store/x23qh86aki8gsc583yp0fjp69j2d43nv-python3-3.10.14/lib/python3.10/shutil.py", line 715, in rmtree
[rank2]:     onerror(os.lstat, path, sys.exc_info())
[rank2]:   File "/nix/store/x23qh86aki8gsc583yp0fjp69j2d43nv-python3-3.10.14/lib/python3.10/shutil.py", line 713, in rmtree
[rank2]:     orig_st = os.lstat(path)
[rank2]: FileNotFoundError: [Errno 2] No such file or directory: 'outputs/zero123-sai/[64, 128, 256]_hamburger_rgba.png/save/it100-val'
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/developer/3d/threestudio/launch.py", line 301, in <module>
[rank3]:     main(args, extras)
[rank3]:   File "/home/developer/3d/threestudio/launch.py", line 244, in main
[rank3]:     trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
[rank3]:     call._call_and_handle_interrupt(
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank3]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank3]:     return function(*args, **kwargs)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
[rank3]:     self._run(model, ckpt_path=ckpt_path)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
[rank3]:     results = self._run_stage()
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
[rank3]:     self.fit_loop.run()
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank3]:     self.advance()
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank3]:     self.epoch_loop.run(self._data_fetcher)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 141, in run
[rank3]:     self.on_advance_end(data_fetcher)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 295, in on_advance_end
[rank3]:     self.val_loop.run()
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
[rank3]:     return loop_run(self, *args, **kwargs)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 142, in run
[rank3]:     return self.on_run_end()
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 254, in on_run_end
[rank3]:     self._on_evaluation_epoch_end()
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 334, in _on_evaluation_epoch_end
[rank3]:     call._call_lightning_module_hook(trainer, hook_name)
[rank3]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook
[rank3]:     output = fn(*args, **kwargs)
[rank3]:   File "/home/developer/3d/threestudio/threestudio/systems/zero123.py", line 319, in on_validation_epoch_end
[rank3]:     shutil.rmtree(
[rank3]:   File "/nix/store/x23qh86aki8gsc583yp0fjp69j2d43nv-python3-3.10.14/lib/python3.10/shutil.py", line 715, in rmtree
[rank3]:     onerror(os.lstat, path, sys.exc_info())
[rank3]:   File "/nix/store/x23qh86aki8gsc583yp0fjp69j2d43nv-python3-3.10.14/lib/python3.10/shutil.py", line 713, in rmtree
[rank3]:     orig_st = os.lstat(path)
[rank3]: FileNotFoundError: [Errno 2] No such file or directory: 'outputs/zero123-sai/[64, 128, 256]_hamburger_rgba.png/save/it100-val'
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/developer/3d/threestudio/launch.py", line 301, in <module>
[rank1]:     main(args, extras)
[rank1]:   File "/home/developer/3d/threestudio/launch.py", line 244, in main
[rank1]:     trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
[rank1]:     results = self._run_stage()
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
[rank1]:     self.fit_loop.run()
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank1]:     self.advance()
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank1]:     self.epoch_loop.run(self._data_fetcher)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 141, in run
[rank1]:     self.on_advance_end(data_fetcher)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 295, in on_advance_end
[rank1]:     self.val_loop.run()
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
[rank1]:     return loop_run(self, *args, **kwargs)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 142, in run
[rank1]:     return self.on_run_end()
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 254, in on_run_end
[rank1]:     self._on_evaluation_epoch_end()
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 334, in _on_evaluation_epoch_end
[rank1]:     call._call_lightning_module_hook(trainer, hook_name)
[rank1]:   File "/home/developer/3d/threestudio/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook
[rank1]:     output = fn(*args, **kwargs)
[rank1]:   File "/home/developer/3d/threestudio/threestudio/systems/zero123.py", line 319, in on_validation_epoch_end
[rank1]:     shutil.rmtree(
[rank1]:   File "/nix/store/x23qh86aki8gsc583yp0fjp69j2d43nv-python3-3.10.14/lib/python3.10/shutil.py", line 715, in rmtree
[rank1]:     onerror(os.lstat, path, sys.exc_info())
[rank1]:   File "/nix/store/x23qh86aki8gsc583yp0fjp69j2d43nv-python3-3.10.14/lib/python3.10/shutil.py", line 713, in rmtree
[rank1]:     orig_st = os.lstat(path)
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'outputs/zero123-sai/[64, 128, 256]_hamburger_rgba.png/save/it100-val'
[rank: 1] Child process with PID 932330 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
[1]    932241 killed     CUDA_VISIBLE_DEVICES=0,1,2,3 python launch.py --config  --train

I've amended the config as follows:

     height: [64, 128, 256]
     width: [64, 128, 256]
-    batch_size: [12, 8, 4]
+    batch_size: [3, 2, 1]
     resolution_milestones: [200, 300]
     eval_height: 512
     eval_width: 512

I'm training on 4 GPUs, but the same happens when attempting to train on a single GPU. Any pointers as to what might be causing this?