glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 299 forks source link

Core dump when running leela-chess on Google Colab #284

Open djinnome opened 6 years ago

djinnome commented 6 years ago

Hi folks,

When I downloaded the 118 weights

%%bash
apt install libboost-all-dev libopenblas-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev

apt install clinfo && clinfo
apt install cmake
git clone https://github.com/glinscott/leela-chess.git

cd leela-chess
git submodule update --init --recursive
mkdir -p build&& cd build

cmake ..
make
./tests

wget -O weights.txt.gz http://lczero.org/get_network?sha=1d1b1a4d9d708ef04d7714b604bddea29122ec2027369e111197f7b9537b1bf8  
gunzip weights.txt.gz
cp ../scripts/train.sh .
./train.sh

I got the following error message:

Using 1 thread(s).
Generated 1924 moves
Detecting residual layers...v1...64 channels...Using 1 thread(s).
Generated 1924 moves
Detecting residual layers...v1...64 channels...6 blocks.
6 blocks.
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
  what():  clGetPlatformIDs
./train.sh: line 13:  2659 Aborted                 (core dumped) ./lczero --weights=weights.txt --randomize -n -t1 --start="train 1" > training.out
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
  what():  clGetPlatformIDs
./train.sh: line 13:  2660 Aborted                 (core dumped) ./lczero --weights=weights.txt --randomize -n -t1 --start="train 2" > training2.out
Akababa commented 6 years ago

Have you tried the tensorflow version? (lc0 folder)

djinnome commented 6 years ago

The advantage of using Google Colab is that everyone with a google account gets free access to 4 GPUs, so if this is actually an issue with leela-chess (and not just my mistake), then it is worth correcting, because you will instantly gain 4 more GPUs for every person who runs leela-chess.

mkdir -p run && cd run
cp ~/leela-chess/build/lczero .
wget -O client_linux https://github.com/glinscott/leela-chess/releases/download/v0.4/client_linux 
chmod +x client_linux && ./client_linux --user djinnome --password XXXX --gpu 1

results in a problem with OpenCL attempting to get the platform ID

Args: [/content/run/lczero --weights=networks/94c816e13232334d6b69353c23ee3185afbc3dd3ab104125131bb93aa1c26e8f -t1 --randomize -n -v1600 -l/content/run/logs-2619/20180411034925.log --start=train 2619-0 1 --gpu=0]
Logging to /content/run/logs-2619/20180411034925.log.
Using 1 thread(s).
Generated 1924 moves
Detecting residual layers...v1...64 channels...6 blocks.
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
  what():  clGetPlatformIDs
2018/04/11 03:49:27 signal: aborted (core dumped)

Just to prove that I really do have a GPU:

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

results in:

Found GPU at: /device:GPU:0
djinnome commented 6 years ago

hey @Akababa Thanks for the suggestion.

It appears that leela-chess/lc0/build.sh needs a few extra packages:

apt install meson ninja-build clang

Now when I try to run ./build.sh, I get the following error:

The Meson build system
Version: 0.42.1
Source dir: /content/leela-chess/lc0
Build dir: /content/leela-chess/lc0/build
Build type: native build
Project name: lc0

Meson encountered an error in file meson.build, line 1, column 0:
Value "c++17" for combo option "cpp_std" is not one of the choices. Possible choices are: "none", "c++03", "c++11", "c++14", "c++1z", "gnu++11", "gnu++14", "gnu++1z".
ninja: error: loading 'build.ninja': No such file or directory

Any suggestions?

This is the entirety of build.sh:

#!/usr/bin/bash

rm -fr build
CC=clang CXX=clang++ meson build --buildtype release
# CC=clang CXX=clang++ meson build --buildtype debugoptimized
cd build
ninja
djinnome commented 6 years ago

Looking at

leela-chess/lc0/meson.build

I see that I need to install the tensorflow from source, because my /usr/local does not contain the files that are expected below. I am also wondering what I need to upgrade/install so that c++17 is an acceptable value for cpp_std as per above.

project('lc0', 'cpp', 
        default_options : ['c_std=c17', 'cpp_std=c++17'])

# add_global_arguments('-Wno-macro-redefined', language : 'cpp')
cc = meson.get_compiler('cpp')

# Installed from https://github.com/FloopCZ/tensorflow_cc
tensorflow_cc = declare_dependency(
  include_directories: include_directories(
    '/usr/local/include/tensorflow',
    '/usr/local/include/tensorflow/bazel-genfiles',
    '/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads',
    '/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads/eigen',
    '/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads/gemmlowp',
    '/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads/nsync/public',
    '/usr/local/include/tensorflow/tensorflow/contrib/makefile/gen/protobuf-host/include',
  ),
  dependencies: [
      cc.find_library('libtensorflow_cc', dirs: '/usr/local/lib/tensorflow_cc/'),
      cc.find_library('dl'),
      cc.find_library('pthread'),
      cc.find_library('libprotobuf', dirs: '/usr/local/lib/tensorflow_cc/'),
  ],
)

deps = []
deps += tensorflow_cc
deps += cc.find_library('stdc++fs')
# deps += dependency('libprofiler')
mooskagh commented 6 years ago

Meson v45 seems not to know about c++17 yet. Can be worked around with:

project('lc0', 'cpp')

add_global_arguments('-std=c++17', language : 'cpp')

I'll change that in the config.

mooskagh commented 6 years ago

Fyi if you want to build tensorflow version of lc0, here in my fork https://github.com/mooskagh/leela-chess is it with some fixes. I still suspect there's something wrong with it (like it often blunders in won position), but you may want to try.

glinscott commented 6 years ago

It looks like the opencl drivers are not working. What does clinfo give?

djinnome commented 6 years ago

Hi @glinscott You are right. clinfo returns the following:

Number of platforms                               0

However, tensorflow can find the GPU:

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

results in:

Found GPU at: /device:GPU:0
djinnome commented 6 years ago

OK,

I ran apt install nvidia-cuda-toolkit and now clinfo returns the following:

Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 9.0.282
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     Tesla K80
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  384.111
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Topology (NV)                            PCI-E, 00:00.4
  Max compute units                               13
  Max clock frequency                             823MHz
  Compute Capability (NV)                         3.7
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              11995578368 (11.17GiB)
  Error Correction support                        Yes
  Max memory allocation                           2998894592 (2.793GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        212992
  Global Memory cache line                        128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             4096x4096x4096 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max constant buffer size                        65536 (64KiB)
  Max number of constant args                     9
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 No
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
djinnome commented 6 years ago

The full successful workflow is as follows:


apt install cmake nvidia-cuda-toolkit git-all libboost-all-dev libopenblas-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev
apt install clinfo && clinfo

followed by:

git clone https://github.com/glinscott/leela-chess.git
cd leela-chess
git submodule update --init --recursive
mkdir -p build && cd build
cmake ..
make

followed by

cp leela-chess/build/lczero .
wget -c https://github.com/glinscott/leela-chess/releases/download/v0.6/client_linux
cp leela-chess/build/lczero .
chmod +x client_linux && ./client_linux --user <your username> --password XXX --debug
djinnome commented 6 years ago

Thanks @blin00 for putting up the wiki page

You can also copy the notebook I got working , which includes saving to google drive and the tuning step from @blin00 .

nousian commented 6 years ago

One of my colab notebooks is throwing an error trying to make leela-chess (other 2 work fine):

!cd leela-chess && cd build && make
In file included from /content/leela-chess/src/OpenCL.h:27:0,
                 from /content/leela-chess/src/OpenCLScheduler.h:26,
                 from /content/leela-chess/src/Network.cpp:49:
/content/leela-chess/src/CL/cl2.hpp:5857:63: warning: ignoring attributes on template argument ‘cl_int {aka int}’ [-Wignored-attributes]
     typename std::enable_if<!std::is_pointer<T>::value, cl_int>::type
                                                               ^
/content/leela-chess/src/CL/cl2.hpp:6157:22: warning: ignoring attributes on template argument ‘cl_int {aka int}’ [-Wignored-attributes]
         vector<cl_int>* binaryStatus = NULL,
                      ^
/content/leela-chess/src/Network.cpp: In static member function ‘static void Network::initialize()’:
/content/leela-chess/src/Network.cpp:507:33: error: ‘openblas_get_corename’ was not declared in this scope
     myprintf("BLAS Core: %s\n", openblas_get_corename());
                                 ^~~~~~~~~~~~~~~~~~~~~
/content/leela-chess/src/Network.cpp:507:33: note: suggested alternative: ‘openblas_set_num_threads’
     myprintf("BLAS Core: %s\n", openblas_get_corename());
                                 ^~~~~~~~~~~~~~~~~~~~~
                                 openblas_set_num_threads
CMakeFiles/objs.dir/build.make:134: recipe for target 'CMakeFiles/objs.dir/src/Network.cpp.o' failed
make[2]: *** [CMakeFiles/objs.dir/src/Network.cpp.o] Error 1
CMakeFiles/Makefile2:104: recipe for target 'CMakeFiles/objs.dir/all' failed
make[1]: *** [CMakeFiles/objs.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Any ideas? Restarting runtime does not work, creating new notebook from scratch results in same error. Other notebooks work just fine, this one used to also before a disconnect.

djinnome commented 6 years ago

Just a guess, but maybe try !make clean before rerunning make? If not, then !rm -rf leela-chess and !git clone https://github.com/glinscott/leela-chess.git then !mkdir -p leela-chess/build && cd leela-chess/build && cmake .. followed by !cd leela-chess/build && make

nousian commented 6 years ago

Inserting:

!rm -rf leela-chess

after apt-install and before git clone -blocks and

!make clean
!cd leela-chess && cd build && make clean

before cmake-block did the trick. 1000+ nps rolling smoothly.

(c)make must have been corrupted/confused somehow. Maybe the runtime did not restart cleanly or something along those lines. Thanks.

lehpo commented 6 years ago

sir

i cannot connect to google colab any more, no matter how hard i try

i get the message failed to assign a backend each time

apart from this, when i was connected, google colab kept disconnecting, sometimes after only 4 minutes

the longest uninterrupted connection was around 2 hours

can u or someone else fix these 2 issues please?

kwccoin commented 5 years ago

The go version needed ubuntu 18.xxx script and not sure whether this is the same for chess; see this in the middle about major change in script for it to work: https://github.com/gcp/leela-zero/issues/1923