leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.35k stars 1.01k forks source link

Segfault during clEnqueueWriteBuffer #2438

Open inclement opened 5 years ago

inclement commented 5 years ago

I've been trying to get LZ working with a radeon rx 580, but it's segfaulting during clEnqueueWriteBuffer.

Quick gdb output:

Detected 1 OpenCL platforms.
Platform version: OpenCL 1.1 Mesa 19.1.1
Platform profile: FULL_PROFILE
Platform name:    Clover
Platform vendor:  Mesa
Device ID:     0
Device name:   Radeon RX 580 Series (POLARIS10, DRM 3.30.0, 5.1.15-arch1-1-ARCH, LLVM 8.0.0)
Device type:   GPU
Device vendor: AMD
Device driver: 19.1.1
Device speed:  1366 MHz
Device cores:  36 CU
Device score:  1111
Selected platform: Clover
Selected device: Radeon RX 580 Series (POLARIS10, DRM 3.30.0, 5.1.15-arch1-1-ARCH, LLVM 8.0.0)
with OpenCL 1.1 capability.
Half precision compute support: Yes.
Tensor Core support: No.
OpenCL: using fp16/half or tensor core compute support.

Started OpenCL SGEMM tuner.
Will try 290 valid configurations.

Thread 1 "tests" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) backtrace
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff6d616a2 in ?? () from /usr/lib/libMesaOpenCL.so.1
#2  0x00007ffff6d52e3f in ?? () from /usr/lib/libMesaOpenCL.so.1
#3  0x00007ffff6d53a13 in ?? () from /usr/lib/libMesaOpenCL.so.1
#4  0x00007ffff6d54291 in ?? () from /usr/lib/libMesaOpenCL.so.1
#5  0x00007ffff6d511a5 in ?? () from /usr/lib/libMesaOpenCL.so.1
#6  0x00007ffff6d4ecde in ?? () from /usr/lib/libMesaOpenCL.so.1
#7  0x00007ffff7ecaa4e in clEnqueueWriteBuffer () from /usr/lib/libOpenCL.so.1
#8  0x00005555556212e8 in cl::CommandQueue::enqueueWriteBuffer (blocking=0, offset=0, events=0x0, event=0x0, 
    ptr=<optimized out>, size=147456, buffer=..., this=<synthetic pointer>)
    at /home/sandy/devel/leela-zero/src/CL/cl2.hpp:7166
#9  Tuner<half_float::half>::tune_sgemm[abi:cxx11](int, int, int, int, int) (this=0x7fffffffdb00, m=8, n=25, k=8, 
    batch_size=36, runs=<optimized out>) at /home/sandy/devel/leela-zero/src/Tuner.cpp:491
#10 0x00005555556227ad in Tuner<half_float::half>::load_sgemm_tuners[abi:cxx11](int, int, int, int) (
    this=0x7fffffffdb00, m=8, n=25, k=8, batch_size=36) at /usr/include/c++/9.1.0/ext/new_allocator.h:89
#11 0x00005555556358a6 in OpenCL<half_float::half>::initialize (this=0x555555769a00, channels=8, batch_size=1)
    at /home/sandy/devel/leela-zero/src/Tuner.cpp:722
#12 0x0000555555635f05 in OpenCLScheduler<half_float::half>::initialize (this=0x555555769920, channels=8)
    at /usr/include/c++/9.1.0/bits/unique_ptr.h:357
#13 0x000055555564b76b in Network::init_net (this=0x7ffff6e04010, channels=8, pipe=...)
    at /usr/include/c++/9.1.0/bits/unique_ptr.h:357
#14 0x00005555556536d2 in Network::select_precision (this=0x7ffff6e04010, channels=8)
    at /usr/include/c++/9.1.0/bits/move.h:74
#15 0x0000555555653ee2 in Network::initialize (this=0x7ffff6e04010, playouts=<optimized out>, weightsfile=...)
    at /home/sandy/devel/leela-zero/src/Network.cpp:573
#16 0x0000555555684979 in LeelaEnv::SetUp (this=<optimized out>) at /home/sandy/devel/leela-zero/src/tests/gtests.cpp:87
#17 0x000055555569af9c in testing::internal::SetUpEnvironment(testing::Environment*) ()
--Type <RET> for more, q to quit, c to continue without paging--
::__normal_iterator<testing::Environment* const*, std::vector<testing::Environment*, std::allocator<testing::Environment*> > >, void (*)(testing::Environment*)))(testing::Environment*) ()
#19 0x00005555556b1931 in void testing::internal::ForEach<std::vector<testing::Environment*, std::allocator<testing::Environment*> >, void (*)(testing::Environment*)>(std::vector<testing::Environment*, std::allocator<testing::Environment*> > const&, void (*)(testing::Environment*)) ()
#20 0x000055555569b207 in testing::internal::UnitTestImpl::RunAllTests() ()
#21 0x00005555556b78c4 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#22 0x00005555556b11e5 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#23 0x0000555555699d2a in testing::UnitTest::Run() ()
#24 0x000055555568824b in RUN_ALL_TESTS() ()
#25 0x00005555556881d9 in main ()

From a quick look I can't see anything obviously wrong, but I'm not very familiar with debugging C++.

I haven't tested the gpu setup much, so I wondered if this could be an issue with my opencl environment, but KataGo does run fine. Any ideas if this could be a LZ issue or must be something else?

iopq commented 5 years ago

KataGo OpenCL branch is working for you? Lol

Try ROCm drivers