Open pursuingz opened 9 months ago
训练了3次都是在第20次时失败,大佬可以看一下吗 前两次是如下报错:
terminate called after throwing an instance of 'c10::Error' what(): CUDA error: misaligned address CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bfb40d4d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4bfb3d736b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4b946cdb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: <unknown function> + 0x1985457 (0x7f4b9696d457 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0x1d4b680 (0x7f4be3baa680 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f4be3bab812 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #6: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f4be481a7bf in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x7f4be3e9e2ab in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #8: <unknown function> + 0x2d2206b (0x7f4be4b8106b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #10: <unknown function> + 0x2b5b453 (0x7f4be49ba453 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x4015f9b (0x7f4be5e74f9b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x401641e (0x7f4be5e7541e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x7f4be43ee819 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #15: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x7f4be3e94e5b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #16: <unknown function> + 0x2eeef81 (0x7f4be4d4df81 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x7f4be456d15e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x132 (0x7f4bfb869d22 in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so) frame #19: NeuralNetwork::infer() + 0xb6b (0x7f4bfb86777b in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so) frame #20: <unknown function> + 0x5972d (0x7f4bfb86872d in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so) frame #21: <unknown function> + 0x145a0 (0x7f4bfba115a0 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so) frame #22: <unknown function> + 0x8609 (0x7f4c1b7ff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #23: clone + 0x43 (0x7f4c1b724133 in /usr/lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped)
后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下:
terminate called after throwing an instance of 'std::runtime_error' what(): The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/neural_network/___torch_mangle_1624.py", line 30, in forward p_conv = self.p_conv res_layers = self.res_layers _0 = (res_layers).forward(inputs, ) ~~~~~~~~~~~~~~~~~~~ <--- HERE _1 = (p_bn).forward((p_conv).forward(_0, ), ) _2 = (relu).forward(_1, ) File "code/__torch__/torch/nn/modules/container/___torch_mangle_1613.py", line 16, in forward _1 = getattr(self, "1") _0 = getattr(self, "0") _4 = (_1).forward((_0).forward(inputs, ), ) ~~~~~~~~~~~ <--- HERE return (_3).forward((_2).forward(_4, ), ) File "code/__torch__/neural_network/___torch_mangle_1594.py", line 25, in forward _1 = (conv2).forward((relu).forward(_0, ), ) _2 = (bn2).forward(_1, ) _3 = (downsample_bn).forward((downsample_conv).forward(inputs, ), ) ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE input = torch.add_(_2, _3) return (relu).forward1(input, ) File "code/__torch__/torch/nn/modules/conv/___torch_mangle_1592.py", line 10, in forward inputs: Tensor) -> Tensor: weight = self.weight input = torch._convolution(inputs, weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True) ~~~~~~~~~~~~~~~~~~ <--- HERE return input Traceback of TorchScript, original code (most recent call last): /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(459): _conv_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(463): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(47): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py(217): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(84): forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward /root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl /root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(1056): trace_module /root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(794): trace /root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(279): save_model /root/autodl-tmp/alpha-zero-gomoku/test/../src/learner.py(114): learn learner_test.py(17): <module> RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Aborted (core dumped)
use CUDA 11.6/PyTorch 1.10/LibTorch 1.10(Pre-cxx11 ABI)/SWIG 4.0.2
训练了3次都是在第20次时失败,大佬可以看一下吗 前两次是如下报错:
后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下: