Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
Totally 22267 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
epoch_worker 0
111
Warning: using Python fallback for SyncBatchNorm, possibly because apex was installed without --cuda_ext. The exception raised when attempting to import the cuda backend was: No module named 'syncbn'
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
Totally 22267 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
epoch_worker 0
111
/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorMathReduce.cuh line=420 error=59 : device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /opt/conda/conda-bld/pytorch_1565272279342/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
Traceback (most recent call last):
File "tool/train.py", line 487, in <module>
main()
File "tool/train.py", line 105, in main
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/disk/tia/3D-to-2D-Distillation-for-Indoor-Scene-Parsing/3d-2d-distillation/tool/train.py", line 243, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) # ERROR
File "/disk/tia/3D-to-2D-Distillation-for-Indoor-Scene-Parsing/3d-2d-distillation/tool/train.py", line 324, in train
output, main_loss, aux_loss, reg_loss, final_loss = model(input, target, feat, featidx) # TODO ERROR
File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/.conda/envs/prnet/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/disk/tia/3D-to-2D-Distillation-for-Indoor-Scene-Parsing/3d-2d-distillation/model/pspnet.py", line 235, in forward
return x2.max(1)[1], reg_loss*0, reg_loss*0 , reg_loss.cuda(), main_loss2
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THC/THCTensorMathReduce.cuh:420
/home/ubuntu/.conda/envs/prnet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
len(cache))
hi, @liuzhengzhe
It reports an error on pytorch 1.2.0, cuda 10.0
How to fix it?
Thank you for sharing the code!