leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.33k stars 1.01k forks source link

Scaling up # of people training #1979

Open chrispsn opened 5 years ago

chrispsn commented 5 years ago

Hi, apologies if this is not the right place to ask (came here via the Discord chat):

I'm hoping to put together something that will scale up by 10x+ the number of games generated for training this AI.

One promising angle is using the 'bottom of the barrel' of spot cloud prices. New instances can be stopped less than an hour after creation, so I need to minimise their spin-up time. Two questions:

  1. Docker images can be cached after download for the next time that machine is available. However, while Windows binaries are distributed, Linux binaries aren't: we're told to compile from source. Even the 'official' Dockerfiles contain compilation steps. Why? Is it feasible to use a Docker image based on Ubuntu that has everything pre-compiled? If it's not as efficient, maybe the number of machines training will outweigh that.
  2. Maybe progress can be saved and continued another time or on another machine. Is there a way to save down the current game state via autogtp on a signal or similar, without using a keyboard shortcut?

Thanks in advance.

wonderingabout commented 5 years ago

hi, as roy7 said on discord, we sort of use this discord like a forum

before answering your questions, the general idea i want to say is your cloud training should be efficient, you dont have to run after the last 0.0001$, but rather globally make it hassle free and max cost efficient

on google cloud, the tesla v100 with preemptibility and autogtp -g 2 is the most cost efficient solution (most games/dollar), i did many tests to come to that conclusion for example, on microsoft azure, it's actually the tesla P100 that is much more cost efficient, as i said here : https://github.com/gcp/leela-zero/issues/1905#issuecomment-433710281

then to answer you :

for 1. google cloud free trial has a GPU quota of 1, so you wont be able to run more than one instance that has a GPU simultaneously

for 2. i dont know about that, you'd have to ask @alreadydone maybe, but it is not going to save much again, as a Tesla V100 takes 4.5 minutes to produce one game (8-9 minutes to produce 2 games with -g 2)

dont hesitate to correct me if you have a different opinion, this is just my thoughts

to increase the number of contributing cloud users, our best shot is to spread more the cloud instructions on social networks i think

edit : also, on page 9 of the google doc, this is the entirely automated script we're using :

https://github.com/gcp/leela-zero#using-a-cloud-provider

#!/bin/bash
PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y install nvidia-driver-410 linux-headers-generic nvidia-opencl-dev && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && sudo apt-get -y install glances zip && sudo apt-get clean && sudo reboot
else 
  sudo -i && cd /leela-zero/autogtp && ./autogtp -g 2
fi
chrispsn commented 5 years ago

Thanks for writing your guide and doing the efficiency tests! They're really good instructions, I've been using them for around a week.

Agree that:

I also agree it's important to have the latest software, but if we want this to be set-and-forget, would it also be good for the installation instructions to be updated automatically? If so, one way to achieve that is for your guide to point to a image that's refreshed every 24 hours, particularly given the instances are likely to be destroyed within 24 hrs anyway. (Another way could be to host the latest script in a GitHub gist, download it to the instance, and run it. Should also flag I have no idea how hard it is to create images for Google's service, I'm only speaking from experience with using Docker.)

On your responses to my questions:

  1. Let me put the question another way: assuming both are using the latest files, how big are the speed disadvantages of using pre-compiled files or an image, as opposed to compiling from scratch?

  2. Agree, I had in mind cases where the instance gets stopped after less than an hour (more applicable to other cloud services).

I don't think security is an issue because:

wonderingabout commented 5 years ago

this is going to take some time for me to answer

nathanloop commented 5 years ago

@wonderingabout That script didn't work for me without some modification:

"sudo add-apt-repository -y ppa:graphics-drivers/ppa"

This doesn't work in the vanilla Ubuntu 18.04 install without first running "sudo apt-get install software-properties-common"

wonderingabout commented 5 years ago

@nathanloop surprising, i tried it last week (at the release of leela zero v16/autogtp v17) and it was working on google cloud

i'll try it another time (atm busy with azure cloud) and let you know if it works

ghost commented 5 years ago
  • However, while Windows binaries are distributed, Linux binaries aren't: we're told to compile from source. Even the 'official' Dockerfiles contain compilation steps. Why?

It is a long tradition with *nix systems based on practical engineering, trust, and incompatbility between different distributions and versions. I might not have the same Ubuntu version, I might have a different desktop on my Ubuntu as you, or might not even have a desktop, or I might not be using a non-Ubuntu Linux, and I might not trust a random binary.

Usually, users will only trust the binaries sent out by the distribution itself in the package manager, and will build everything else from source.

chrispsn commented 5 years ago

Thanks. Is there a difference in processing (game generation) speed from compiling afresh each time instead of using a "one-size-fits-all" binary?

ghost commented 5 years ago

Theoretically building from source will give you an optimised build for your system. In practise, this might or might not occur.

As an example of when the theory and practise conform, the standard build of the Python interpreter is much slower than the one that ships with Ubuntu, due to Ubuntu-specific optimisations.

nathanloop commented 5 years ago

@wonderingabout my mistake. I was 18.04 minimal, normal 18.04 and it works fine.