hijkzzz / alpha-zero-gomoku

A Multi-threaded Implementation of AlphaZero
360 stars 48 forks source link

训练20次时失败 #37

Open pursuingz opened 8 months ago

pursuingz commented 8 months ago

训练了3次都是在第20次时失败,大佬可以看一下吗 前两次是如下报错:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bfb40d4d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4bfb3d736b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4b946cdb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1985457 (0x7f4b9696d457 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1d4b680 (0x7f4be3baa680 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f4be3bab812 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f4be481a7bf in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x7f4be3e9e2ab in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2d2206b (0x7f4be4b8106b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b5b453 (0x7f4be49ba453 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4015f9b (0x7f4be5e74f9b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x401641e (0x7f4be5e7541e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x7f4be43ee819 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x7f4be3e94e5b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2eeef81 (0x7f4be4d4df81 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x7f4be456d15e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x132 (0x7f4bfb869d22 in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #19: NeuralNetwork::infer() + 0xb6b (0x7f4bfb86777b in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #20: <unknown function> + 0x5972d (0x7f4bfb86872d in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #21: <unknown function> + 0x145a0 (0x7f4bfba115a0 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #22: <unknown function> + 0x8609 (0x7f4c1b7ff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x43 (0x7f4c1b724133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下:

terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/neural_network/___torch_mangle_1624.py", line 30, in forward
    p_conv = self.p_conv
    res_layers = self.res_layers
    _0 = (res_layers).forward(inputs, )
          ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _1 = (p_bn).forward((p_conv).forward(_0, ), )
    _2 = (relu).forward(_1, )
  File "code/__torch__/torch/nn/modules/container/___torch_mangle_1613.py", line 16, in forward
    _1 = getattr(self, "1")
    _0 = getattr(self, "0")
    _4 = (_1).forward((_0).forward(inputs, ), )
                       ~~~~~~~~~~~ <--- HERE
    return (_3).forward((_2).forward(_4, ), )
  File "code/__torch__/neural_network/___torch_mangle_1594.py", line 25, in forward
    _1 = (conv2).forward((relu).forward(_0, ), )
    _2 = (bn2).forward(_1, )
    _3 = (downsample_bn).forward((downsample_conv).forward(inputs, ), )
                                  ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    input = torch.add_(_2, _3)
    return (relu).forward1(input, )
  File "code/__torch__/torch/nn/modules/conv/___torch_mangle_1592.py", line 10, in forward
    inputs: Tensor) -> Tensor:
    weight = self.weight
    input = torch._convolution(inputs, weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
            ~~~~~~~~~~~~~~~~~~ <--- HERE
    return input

Traceback of TorchScript, original code (most recent call last):
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(459): _conv_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(463): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(47): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py(217): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(84): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(1056): trace_module
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(794): trace
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(279): save_model
/root/autodl-tmp/alpha-zero-gomoku/test/../src/learner.py(114): learn
learner_test.py(17): <module>
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Aborted (core dumped)
hijkzzz commented 7 months ago

use CUDA 11.6/PyTorch 1.10/LibTorch 1.10(Pre-cxx11 ABI)/SWIG 4.0.2