I can successfully train on a single GPU with a batch size of 4, but am unable to train on 4 GPUs with a batch size of 16.
I get the following error message:
Lock file exists in build directory: '/gpfs/u/home/~/.cache/torch_extensions/nvdiffrast_plugin/lock'
tick 0 kimg 0.0 time 27m 55s sec/tick 1665.6 sec/kimg 104099.05 maintenance 9.2
==> start visualization
Traceback (most recent call last):
File "train_3d.py", line 339, in <module>
main() # pylint: disable=no-value-for-parameter
File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "train_3d.py", line 333, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "train_3d.py", line 107, in launch_training
torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS
I can successfully train on a single GPU with a batch size of 4, but am unable to train on 4 GPUs with a batch size of 16.
I get the following error message: