glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
760 stars 299 forks source link

Dedicated training hardware #194

Closed Error323 closed 6 years ago

Error323 commented 6 years ago

Currently my system [1] is performing the training which is happening now on a 24/7 basis. Which is awesome! I'm very happy to do it, I love doing it in fact, but there is the problem of the machine also being used as my workstation. Although the V2 chunks help tremendously in neural net throughput, it's still very straining on my system - 70% total CPU usage, 20GiB RAM usage. But more importantly it just can't be dedicated for the long haul like this as I'm rebooting/developing rebooting, messing up mem usage because of developing and rebooting some more (you know how it goes).

I think it would be of great benefit to everyone if we can get some dedicated solution. Unfortunately cloud based solutions including a GPU are very expensive. And we certainly need a GPU for this. We discussed this somewhat in the chatroom and a solution we thought of would be to buy some dedicated hardware that we can use which I (or someone else) could run 24/7. I'm willing to donate a GPU (1080Ti), a case and electric bill cost (though the latter may get a bit scary). So we'd need memory (The more the better), motherboard, cpu (at least 4 cores) and a harddisk ~ 800 euros or something?

Maybe there are better solutions that you can think of, and I also would like to note that it's not necessary right now, but between now and 2 weeks to a month. Let's discuss...

[1] Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz, 32G RAM, 2x 1080Ti

ghost commented 6 years ago

I suggest starting a Patreon to collect money for a computer and cover the ongoing electricity cost. They take a 5% cut and 5% payment processing. I'd be willing to donate.

Uriopass commented 6 years ago

Also as a quick mockup, this config would be (I think) ideal and it's 948$ (765€), please correct me if you see adjustments:

image

I'd also be willing to donate. The only other solution would be to find someone else who can give @Error323 full access of a Good CPU+GPU and accepting letting it run 24/24. We never know, maybe someone has very good hardware ?

jjoshua2 commented 6 years ago

I have a unused system i5 3570k at 4.2ghz with 16gb ddr3 2134 ram and a amd rx 580. How does this scale with cores? Ryzen 8 cores are much better value than Intel if it can actually use them, but if not those old quad core Intels clock really far. Maybe with a 1080 ti it would be good enough? Or also need more ram?

On Tue, Mar 27, 2018, 10:28 AM Douady Pâris notifications@github.com wrote:

Also as a quick mockup, this config would be (I think) ideal and it's 948$ (765€), please correct me if you see adjustments:

[image: image] https://user-images.githubusercontent.com/5420739/37973744-7c44f750-31db-11e8-943b-9b2b15719a8d.png

I'd also be willing to donate. The only other solution would be to find someone else who can give @Error323 https://github.com/Error323 full access of a Good CPU+GPU and accepting letting it run 24/24. We never know, maybe someone has very good hardware ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/194#issuecomment-376545926, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INJJGTLyn5ki2bPhV8GcCNw9qV1vvks5tikyVgaJpZM4S8-Cy .

pw31 commented 6 years ago

Is this an alternative? https://cloudplatform.googleblog.com/2018/02/Cloud-TPU-machine-learning-accelerators-now-available-in-beta.html. We only need 40 hours, right? There is a "start your free trial" ...

Uriopass commented 6 years ago

We need 40 hours... With 5000 TPUs.

brianprichardson commented 6 years ago

One of these would be helpful...

http://www.theregister.co.uk/2018/03/27/pure_nvidia_ai_airi

pw31 commented 6 years ago

Deepmind used 5000 "first-generation TPUs" for 4 h = 20000 TPU-hours. Google sells the new cloud TPU time at $6.5 per TPU-hour. That would be $130.000.

nousian commented 6 years ago

I'd be willing to donate a proper server room with HVAC and access control etc, remote access over internet to the machine and all the power needed as long as it takes. And there is spare room for multiple machines if needed.

I have few spare i5 machines 3.x Ghz with 8/16GB RAM, 1050/1060, with upgrade to 1080 ti maybe one of these would be good enough?

Error323 commented 6 years ago

Hi @nousian,

That sounds awesome! One of those would be good enough for sure, more would allow parallel experimentation for potential faster convergence. The bigger the main memory the bigger our shufflebuffer which helps with good random distributions in our training batches.

How would you like to proceed?

nousian commented 6 years ago

Let me check the precise details on those spare machines today - they are at my office. If they need more RAM, I can buy some for sure, within reason. How much would be "enough"?

I think the key ingredient is the GPU, none are available anywhere here (fastest in stock is 1050TI / 1060 that I already have), you have a 1080 ti that could be used for this?

And what OS is needed for this? Those are semi-retired office pcs so they have Windows 10 on them, but its easy to install Debian or Fedora. SSH-remote access would be enough?

jkiliani commented 6 years ago

What is the actual speed difference in training between a 1060 and a 1080ti? If training currently is still CPU bound (is it?), the 1080ti may not even make a difference, although that would of course change once larger nets are used.

nousian commented 6 years ago

Sorry I got wrong model (there is no 1060ti). I have:

GTX 1050ti 4GB - 1981 Gflops GTX 1060 6GB - 3855 Gflops

There is quite a jump:

GTX 1070 - 5783 Gflops GTX 1070ti - 7816 Gflops GTX 1080 - 8228 Gflops GTX 1080ti - 10609 Gflops

Error323 commented 6 years ago

Hi @nousian,

Let me enumerate some suggestions which I think would work best here:

Maybe it would be best if we discuss further details offline. You can reach me through my github email account.

nousian commented 6 years ago

Ok @Error323 I will contact you later today.

Error323 commented 6 years ago

I'm happy to report that this has been fixed brilliantly by @nousian. He donated 3 (!), yeah that's right! three machines for us to experiment and train on.

lc0.train.1

lc0.train.2

lc0.train.3

This means we can experiment in parallel with various nets and/or improve our diagnostic tools. Thank you @nousian :trophy: :fireworks: :1st_place_medal: