threestudio-project / threestudio

A unified framework for 3D content generation.
Apache License 2.0
6.17k stars 475 forks source link

Pickle Error in Distributed mode #425

Open vishalghor opened 7 months ago

vishalghor commented 7 months ago

While running the multi-gpu examples using the readme steps :

Traceback (most recent call last):
  File "/home/user/projects/threestudio/launch.py", line 304, in <module>
    main(args, extras)
  File "/home/user/projects/threestudio/launch.py", line 247, in main
    trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
  File "/home/vghorpad/.conda/envs/training/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/user/.conda/envs/training/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/user/.conda/envs/training/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/home/user/.conda/envs/training/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    process.start()
  File "/home/user/.conda/envs/training/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/user/.conda/envs/training/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/home/user/.conda/envs/training/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/user/.conda/envs/training/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/user/.conda/envs/training/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/user/.conda/envs/training/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_activation.<locals>.<lambda>'
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Looking up the suggestion for similar error using https://discuss.huggingface.co/t/cant-pickle-error-using-accelerate-multi-gpu/32358 Found that the error arises from https://github.com/threestudio-project/threestudio/blob/main/launch.py#L169

  system: BaseSystem = threestudio.find(cfg.system_type)(
        cfg.system, resumed=cfg.resume is not None
   )

this is not a machine or driver issue as I confirmed the following: