CUDA error when train network using multi-GPU

When I train network with multi GPU, It shows CUDA error on Discriminator. I'm using 3 1080ti GPU and CUDA version is 9.0 The error message is below Something went wrong: CUDA error: device-side assert triggered (createEvent at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/cuda/CUDAEvent.h:173) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f7a67462dc5 in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0x1104e5f (0x7f7a6b9b6e5f in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #2: void (anonymous namespace)::copy_device_to_device<float, float>(at::Tensor&, at::Tensor const&) + 0x1ad (0x7f7a6ccce55d in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #3: void (anonymous namespace)::_copy__cuda(at::Tensor&, at::Tensor const&, bool) + 0xa0b (0x7f7a6cd2ba1b in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #4: at::native::_s_copy__cuda(at::Tensor&, at::Tensor const&, bool) + 0x198 (0x7f7a6cc6a6e8 in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #5: at::CUDAType::scopy(at::Tensor&, at::Tensor const&, bool) const + 0xcf (0x7f7a6ba859bf in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2gpu.so) frame #6: at::native::copy(at::Tensor&, at::Tensor const&, bool) + 0x26d (0x7f7a67cef2ad in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shotface/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #7: torch::autograd::VariableType::copy(at::Tensor&, at::Tensor const&, bool) const + 0x64f (0x7f7a60323cff in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #8: torch::cuda::gather(c10::ArrayRef, long, c10::optional) + 0x40f (0x7f7a606538ff in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #9: + 0x5a706e (0x7f7a8d6aa06e in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #10: + 0x12ce4a (0x7f7a8d22fe4a in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #21: THPFunction_apply(_object*, _object*) + 0x691 (0x7f7a8d4b2081 in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch_python.so) Traceback (most recent call last): File "run.py", line 345, in main() File "run.py", line 341, in main raise e File "run.py", line 325, in main continue_id=config.continue_id, File "run.py", line 120, in meta_train r_x_hat, D_act_hat = D(x_hat, y_t, i)

grey-eye / talking-heads

CUDA error when train network using multi-GPU #23