liuzhengzhe / 3D-to-2D-Distillation-for-Indoor-Scene-Parsing

CVPR 2021 Oral https://arxiv.org/abs/2104.02243
47 stars 4 forks source link

RuntimeError: cuda runtime error (59) : device-side assert triggered #3

Closed mtli77 closed 3 years ago

mtli77 commented 3 years ago

hi, @liuzhengzhe

It reports an error on pytorch 1.2.0, cuda 10.0

Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
Totally 22267 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
epoch_worker 0
111
Warning:  using Python fallback for SyncBatchNorm, possibly because apex was installed without --cuda_ext.  The exception raised when attempting to import the cuda backend was:  No module named 'syncbn'
Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
Totally 22267 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
epoch_worker 0
111
/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorMathReduce.cuh line=420 error=59 : device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /opt/conda/conda-bld/pytorch_1565272279342/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
Traceback (most recent call last):
  File "tool/train.py", line 487, in <module>
    main()
  File "tool/train.py", line 105, in main
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
  File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/disk/tia/3D-to-2D-Distillation-for-Indoor-Scene-Parsing/3d-2d-distillation/tool/train.py", line 243, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)   # ERROR
  File "/disk/tia/3D-to-2D-Distillation-for-Indoor-Scene-Parsing/3d-2d-distillation/tool/train.py", line 324, in train
    output, main_loss, aux_loss, reg_loss, final_loss = model(input, target, feat, featidx)    # TODO ERROR
  File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/parallel/distributed.py", line 560, in forward
    result = self.module(*inputs, **kwargs)
  File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/disk/tia/3D-to-2D-Distillation-for-Indoor-Scene-Parsing/3d-2d-distillation/model/pspnet.py", line 235, in forward
    return x2.max(1)[1], reg_loss*0, reg_loss*0 , reg_loss.cuda(), main_loss2
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorMathReduce.cuh:420

/home/ubuntu/.conda/envs/prnet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
  len(cache))

How to fix it?
Thank you for sharing the code!