grey-eye / talking-heads

Our implementation of "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models" (Egor Zakharov et al.)
GNU General Public License v3.0
593 stars 110 forks source link

CUDA error when train network using multi-GPU #23

Closed gwangyoungyoum closed 5 years ago

gwangyoungyoum commented 5 years ago

When I train network with multi GPU, It shows CUDA error on Discriminator. I'm using 3 1080ti GPU and CUDA version is 9.0 The error message is below Something went wrong: CUDA error: device-side assert triggered (createEvent at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/cuda/CUDAEvent.h:173) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f7a67462dc5 in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0x1104e5f (0x7f7a6b9b6e5f in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #2: void (anonymous namespace)::copy_device_to_device<float, float>(at::Tensor&, at::Tensor const&) + 0x1ad (0x7f7a6ccce55d in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #3: void (anonymous namespace)::_copy__cuda(at::Tensor&, at::Tensor const&, bool) + 0xa0b (0x7f7a6cd2ba1b in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #4: at::native::_s_copy__cuda(at::Tensor&, at::Tensor const&, bool) + 0x198 (0x7f7a6cc6a6e8 in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #5: at::CUDAType::scopy(at::Tensor&, at::Tensor const&, bool) const + 0xcf (0x7f7a6ba859bf in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libcaffe2gpu.so) frame #6: at::native::copy(at::Tensor&, at::Tensor const&, bool) + 0x26d (0x7f7a67cef2ad in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shotface/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #7: torch::autograd::VariableType::copy(at::Tensor&, at::Tensor const&, bool) const + 0x64f (0x7f7a60323cff in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #8: torch::cuda::gather(c10::ArrayRef, long, c10::optional) + 0x40f (0x7f7a606538ff in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #9: + 0x5a706e (0x7f7a8d6aa06e in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #10: + 0x12ce4a (0x7f7a8d22fe4a in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #21: THPFunction_apply(_object*, _object*) + 0x691 (0x7f7a8d4b2081 in /home/yky5464/.conda/envs/gwangyoung_pytorch_few_shot_face/lib/python3.6/site-packages/torch/lib/libtorch_python.so) Traceback (most recent call last): File "run.py", line 345, in main() File "run.py", line 341, in main raise e File "run.py", line 325, in main continue_id=config.continue_id, File "run.py", line 120, in meta_train r_x_hat, D_act_hat = D(x_hat, y_t, i)
castelo-software commented 5 years ago

I think you're running into this problem: https://github.com/grey-eye/talking-heads/issues/22 I will probably commit my code soon!