nv-tlabs / GET3D

Other
4.17k stars 374 forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #156

Open jpainam opened 5 months ago

jpainam commented 5 months ago

Hi Has anyome managed to train multi-gpus? I'm using this command python train_3d.py --outdir=./outdir --data=shapenet_get3d/img/03790512 --camera_path shapenet_get3d/camera --gpus=8 --batch=32 --gamma=40 --data_camera_mode shapenet_motorbike --dmtet_scale 1.0 --use_shapenet_split 1 --one_3d_generator 0 --img_res=256 --kimg=200 --workers 1

Constructing networks...
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Setting up augmentation...
Distributing across 8 GPUs...
Traceback (most recent call last):
  File "train_3d.py", line 339, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_3d.py", line 333, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train_3d.py", line 107, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "~/miniconda3x86/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "~/GET3D/train_3d.py", line 51, in subprocess_fn
    training_loop_3d.training_loop(rank=rank, **c)
  File "~/GET3D/training/training_loop_3d.py", line 159, in training_loop
    G = dnnlib.util.construct_class_by_name(**G_kwargs, **common_kwargs).train().requires_grad_(False).to(
  File "~/GET3D/dnnlib/util.py", line 306, in construct_class_by_name
    return call_func_by_name(*args, func_name=class_name, **kwargs)
  File "~/GET3D/dnnlib/util.py", line 301, in call_func_by_name
    return func_obj(*args, **kwargs)
  File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
    super().__init__(*args, **kwargs)
  File "~/GET3D/training/networks_get3d.py", line 599, in __init__
    self.synthesis = DMTETSynthesisNetwork(
  File "~/GET3D/torch_utils/persistence.py", line 105, in __init__
    super().__init__(*args, **kwargs)
  File "~/GET3D/training/networks_get3d.py", line 81, in __init__
    self.dmtet_geometry = DMTetGeometry(
  File "~/GET3D/uni_rep/rep_3d/dmtet.py", line 423, in __init__
    all_edges_sorted = torch.sort(all_edges, dim=1)[0]
RuntimeError: CUDA error: an illegal memory access was encountered