Scaling up # of people training

chrispsn commented 6 years ago

Hi, apologies if this is not the right place to ask (came here via the Discord chat):

I'm hoping to put together something that will scale up by 10x+ the number of games generated for training this AI.

One promising angle is using the 'bottom of the barrel' of spot cloud prices. New instances can be stopped less than an hour after creation, so I need to minimise their spin-up time. Two questions:

Docker images can be cached after download for the next time that machine is available. However, while Windows binaries are distributed, Linux binaries aren't: we're told to compile from source. Even the 'official' Dockerfiles contain compilation steps. Why? Is it feasible to use a Docker image based on Ubuntu that has everything pre-compiled? If it's not as efficient, maybe the number of machines training will outweigh that.
Maybe progress can be saved and continued another time or on another machine. Is there a way to save down the current game state via autogtp on a signal or similar, without using a keyboard shortcut?

Thanks in advance.

wonderingabout commented 6 years ago

hi, as roy7 said on discord, we sort of use this discord like a forum

before answering your questions, the general idea i want to say is your cloud training should be efficient, you dont have to run after the last 0.0001$, but rather globally make it hassle free and max cost efficient

on google cloud, the tesla v100 with preemptibility and autogtp -g 2 is the most cost efficient solution (most games/dollar), i did many tests to come to that conclusion for example, on microsoft azure, it's actually the tesla P100 that is much more cost efficient, as i said here : https://github.com/gcp/leela-zero/issues/1905#issuecomment-433710281

then to answer you :

if i'm not mistaken google cloud uses pricing per second after the 1st minute of an instance's life : https://cloud.google.com/compute/pricing#billingmodel , so i dont understand what cost you want to optimize here
"New instances can be stopped less than an hour after creation" : ok, but you'll still be billed as stated above, so again i dont understand the purpose
by "training this AI", i understand that you mean our leela-zero with autogtp right ?
you maybe be aware, but you can use the automated instructions on github, they dont require to manually recreate your instances everytime, a lot less hassle : https://github.com/gcp/leela-zero#using-a-cloud-provider
an instance takes 10 minutes to update itself, download latest version of all packages (this is important) and reboot then start playing games
these 10 minutes allow you to produce 2 games with a V100
these 10 minutes are restarted at every new instance creation, on average every 12 hours
so i think saving 10 minutes out of 720 minutes is fairly neglectible (1.3%)
and to save that time, you'll exhaust yourself a lot more checking if your instance is still ON, manually resetting parameters everytime, etc
using an instance image can be interesting for specific projects, but if you just want to contribute to leela-zero with autogtp, i think you'd rather spend these 10 minutes to have latest version of all packages
not only an auto updated instance is less hassle free to create, but also a custom image requires much more time to build and will always be outdated after some time, plus it's harder to trust as you dont know what is installed in it

for 1. google cloud free trial has a GPU quota of 1, so you wont be able to run more than one instance that has a GPU simultaneously

for 2. i dont know about that, you'd have to ask @alreadydone maybe, but it is not going to save much again, as a Tesla V100 takes 4.5 minutes to produce one game (8-9 minutes to produce 2 games with -g 2)

dont hesitate to correct me if you have a different opinion, this is just my thoughts

to increase the number of contributing cloud users, our best shot is to spread more the cloud instructions on social networks i think

edit : also, on page 9 of the google doc, this is the entirely automated script we're using :

https://github.com/gcp/leela-zero#using-a-cloud-provider

#!/bin/bash
PKG_OK=$(dpkg-query -W --showformat='${Status}\n' glances|grep "install ok installed")
echo Checking for glanceslib: $PKG_OK
if [ "" == "$PKG_OK" ]; then
  echo "No glanceslib. Setting up glanceslib and all other leela-zero packages."
  sudo apt-get update && sudo apt-get -y upgrade && sudo apt-get -y dist-upgrade && sudo add-apt-repository -y ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt-get -y install nvidia-driver-410 linux-headers-generic nvidia-opencl-dev && sudo apt-get -y install clinfo cmake git libboost-all-dev libopenblas-dev zlib1g-dev build-essential qtbase5-dev qttools5-dev qttools5-dev-tools libboost-dev libboost-program-options-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev qt5-default qt5-qmake curl && git clone https://github.com/gcp/leela-zero && cd leela-zero && git submodule update --init --recursive && mkdir build && cd build && cmake .. && cmake --build . && cd ../autogtp && cp ../build/autogtp/autogtp . && cp ../build/leelaz . && sudo apt-get -y install glances zip && sudo apt-get clean && sudo reboot
else 
  sudo -i && cd /leela-zero/autogtp && ./autogtp -g 2
fi

chrispsn commented 6 years ago

Thanks for writing your guide and doing the efficiency tests! They're really good instructions, I've been using them for around a week.

Agree that:

absolutely, it should be cost-efficient (games/$)
yes, I'm referring to leela-zero with autogtp
right now, the best thing to do from a 'impact per hours and $ spent' perspective is to spread the cloud instructions on social networks (free credit is an easy sell).

I also agree it's important to have the latest software, but if we want this to be set-and-forget, would it also be good for the installation instructions to be updated automatically? If so, one way to achieve that is for your guide to point to a image that's refreshed every 24 hours, particularly given the instances are likely to be destroyed within 24 hrs anyway. (Another way could be to host the latest script in a GitHub gist, download it to the instance, and run it. Should also flag I have no idea how hard it is to create images for Google's service, I'm only speaking from experience with using Docker.)

On your responses to my questions:

Let me put the question another way: assuming both are using the latest files, how big are the speed disadvantages of using pre-compiled files or an image, as opposed to compiling from scratch?
Agree, I had in mind cases where the instance gets stopped after less than an hour (more applicable to other cloud services).

I don't think security is an issue because:

we can ensure the party that maintains that image is trusted (it's not like users read all the source to Leela Zero before building and running it)
users can see the Dockerfile that created the image and build it themselves if they want
the image is usually run in a sandbox or on a cloud provider so it won't be able to mess with their files.

wonderingabout commented 6 years ago

this is going to take some time for me to answer

nathanloop commented 6 years ago

@wonderingabout That script didn't work for me without some modification:

"sudo add-apt-repository -y ppa:graphics-drivers/ppa"

This doesn't work in the vanilla Ubuntu 18.04 install without first running "sudo apt-get install software-properties-common"

wonderingabout commented 6 years ago

@nathanloop surprising, i tried it last week (at the release of leela zero v16/autogtp v17) and it was working on google cloud

i'll try it another time (atm busy with azure cloud) and let you know if it works

ghost commented 6 years ago

However, while Windows binaries are distributed, Linux binaries aren't: we're told to compile from source. Even the 'official' Dockerfiles contain compilation steps. Why?

It is a long tradition with *nix systems based on practical engineering, trust, and incompatbility between different distributions and versions. I might not have the same Ubuntu version, I might have a different desktop on my Ubuntu as you, or might not even have a desktop, or I might not be using a non-Ubuntu Linux, and I might not trust a random binary.

Usually, users will only trust the binaries sent out by the distribution itself in the package manager, and will build everything else from source.

chrispsn commented 6 years ago

Thanks. Is there a difference in processing (game generation) speed from compiling afresh each time instead of using a "one-size-fits-all" binary?

ghost commented 6 years ago

Theoretically building from source will give you an optimised build for your system. In practise, this might or might not occur.

As an example of when the theory and practise conform, the standard build of the Python interpreter is much slower than the one that ships with Ubuntu, due to Ubuntu-specific optimisations.

nathanloop commented 6 years ago

@wonderingabout my mistake. I was 18.04 minimal, normal 18.04 and it works fine.

leela-zero / leela-zero

Scaling up # of people training #1979