nv-tlabs / GET3D

Other
4.2k stars 376 forks source link

torch.multiprocessing does not work for multiple GPUs #155

Open osamarais opened 8 months ago

osamarais commented 8 months ago

I can successfully train on a single GPU with a batch size of 4, but am unable to train on 4 GPUs with a batch size of 16.

I get the following error message:

Lock file exists in build directory: '/gpfs/u/home/~/.cache/torch_extensions/nvdiffrast_plugin/lock'
tick 0     kimg 0.0      time 27m 55s      sec/tick 1665.6  sec/kimg 104099.05 maintenance 9.2   
==> start visualization
Traceback (most recent call last):
  File "train_3d.py", line 339, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_3d.py", line 333, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train_3d.py", line 107, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS