Hi
Has anyome managed to train multi-gpus? I'm using this command
python train_3d.py --outdir=./outdir --data=shapenet_get3d/img/03790512 --camera_path shapenet_get3d/camera --gpus=8 --batch=32 --gamma=40 --data_camera_mode shapenet_motorbike --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 0 --img_res=256 --kimg=200 --workers 1
Constructing networks...
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Setting up augmentation...
Distributing across 8 GPUs...
Traceback (most recent call last):
File "train_3d.py", line 339, in <module>
main() # pylint: disable=no-value-for-parameter
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "train_3d.py", line 333, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "train_3d.py", line 107, in launch_training
torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "~/GET3D/train_3d.py", line 51, in subprocess_fn
training_loop_3d.training_loop(rank=rank, **c)
File "~/GET3D/training/training_loop_3d.py", line 159, in training_loop
G = dnnlib.util.construct_class_by_name(**G_kwargs, **common_kwargs).train().requires_grad_(False).to(
File "~/GET3D/dnnlib/util.py", line 306, in construct_class_by_name
return call_func_by_name(*args, func_name=class_name, **kwargs)
File "~/GET3D/dnnlib/util.py", line 301, in call_func_by_name
return func_obj(*args, **kwargs)
File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
super().__init__(*args, **kwargs)
File "~/GET3D/training/networks_get3d.py", line 599, in __init__
self.synthesis = DMTETSynthesisNetwork(
File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
super().__init__(*args, **kwargs)
File "~/GET3D/training/networks_get3d.py", line 81, in __init__
self.dmtet_geometry = DMTetGeometry(
File "~/GET3D/uni_rep/rep_3d/dmtet.py", line 423, in __init__
all_edges_sorted = torch.sort(all_edges, dim=1)[0]
RuntimeError: CUDA error: an illegal memory access was encountered
Hi Has anyome managed to train multi-gpus? I'm using this command
python train_3d.py --outdir=./outdir --data=shapenet_get3d/img/03790512 --camera_path shapenet_get3d/camera --gpus=8 --batch=32 --gamma=40 --data_camera_mode shapenet_motorbike --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 0 --img_res=256 --kimg=200 --workers 1