Bigger, stronger, not faster (4dc12a8e etc)

gcp commented 6 years ago

There are a number of tests with bigger networks running, notably 128x20 and 192x10 configurations. Probably due to issue #1025, the 128x20 networks failed rather badly, but 192x10 has been doing OK (it was not affected by the bug due to fortunate circumstances).

If the new networks are noticeably stronger than the best 128x10, it makes sense to switch. It looks like we may already be very close to this point.

What the boundary exactly is is not so clear: if 128x10 has stalled (it has not, so far), then it's clear. If the new net is several hundreds of Elo stronger, then it's clear. In the middle there's some trade-off between having 1/2 the games played at a higher level, versus being able to iterate new networks faster at a smaller size. I am not worrying too much about this zone, as inevitably we will hit either of the "clear" cases sooner rather than later.

I would also prefer to be able to retrain the 128x20 so it can be tested against 192x10, but I don't consider it a blocker for moving up (if 128x20 ends up surpassing 192x10 significantly, we'll just let it promote...).

(I originally planned for 128x10 to be the "final size" for this "first run", but the progress made with bootstrap/net2net, improved client speed and the current strength are certainly making me want to keep going at it for a bit longer...)

bochen2027 commented 6 years ago

We already know Google deepmind can do a pure 20 or 40 blocks from zero scratch. Wouldn't it be in addition even more interesting to see just how far this project can go by starting out with a small network and essentially expanding to ever larger networks as it outgrew them and compare to what know if a perfect scenario (AGZ) to see what the limits to this process are? To that end, why stop at 20 or even 10? Why not eventually go to 40 even for this so-called "first run"? Who knows, it might maybe even turn out in the end to be the only run that is ever needed.

BTW, if/when you do decide to start over complete for a clean second run, will you keep the first run alive as long as it is making progress and let folks choose which run they want to contribute cycles to or will you wait until first run flatlined until starting second run

My guess is part of the reason for starting at 4 or five blocks and moving up to 6, then 10 then 20 is that you wanted to test the waters and a proof of concept and to get community traction first, as most would be put off by the daunting element of starting with a 40 block at the very beginning because there wouldn't be enough progress to garner enough interest which means less people contributing cycles and its a catch-22 and just diesoff and project never takes off the ground. So in my mind the process is just as important as the end result, if you can demonstrate that something like this is possible, to start out small and grow your way into bigger and better, then that would show something novel that the likes of Deepmind/Tencent/etc never needed to and haven't shown. Not every project or community has the resources of a big player like Google and as crowd distributed kickstarter-like projects proliferate its ever the more important to find effective efficient ways of increasing performance and delivering results so I feel the value proposition of something like this, to see it to 20 and 40 blocks goes beyond just the case of Go, and may have reverberating effects throughout much of the entire rest of the deep-learning community.

jkiliani commented 6 years ago

As for this being the only run that's ever needed, while I'd certainly be in favour, this also depends on the hardware configuration of the training server: Would it still be practical for a single machine to train a neural network of Deepmind's 256x20 or even 256x40 dimensions? I can't answer if there are practical roadblocks to simply continuing until we reach AGZ, nice as this would be...

Isn't there some problem with your (GPU) RAM and the mini-batch size at some point? Sorry if I got that wrong...

gcp commented 6 years ago

The speed of the training (machine) is not related to the size of the network, only on the speed * amount of the clients. I think I've explained a few times why that is: if the training speed halves, the rate of game production will also already have halved because the clients have taken the same speed hit.

And note the current setup "wastes" a lot of training effort by restarting from the best network. There are enough ways out of this that I don't really see this as a fatal roadblock.

Dorus commented 6 years ago

If we want to do a fresh run, it would make most sense to do it with a 6x128 net. The game generation speed would be huge and we could get it maxed in relatively little time. If a fresh 6x128 net can surpass our last 6x128 net by a lot, we know we should start fresh with 20x256. If it's comparable, we can still use that more cleaner 6x128 net to net2net to 10x128 -> 10x192 -> 20x192 -> 20x256 -> 40x256 etc.

I do not think it's smart to do a fresh run with a 20x256 net or anything huge like that, we really need to confirm that starting with small net is damaging for final strength.

I still like a single fresh run just to see how all our changes affect network progression at lower elo levels, even just for that a new run would be nice, but it shouldn't take more than 2-4 week to run it. If my estimate is correct, we should be able to pump out 1.4M games / week at 6x128 right now (well, prob halve the first week as we gotta keep -r low then and -v will be very spread out so reused is low), but 2-4 week should be enough for 1.5-4.5M games and enough to get a net near max.

Edit: other things on the wishlist (for a fresh run or not): AZ always promote. AZ t=1 all game.

gcp commented 6 years ago

Why not eventually go to 40 even for this so-called "first run"? Who knows, it might maybe even turn out in the end to be the only run that is ever needed.

If it turns out to be practical, sure. But I am wary of the total time required, as well as the last "phase" where only slow and very incremental progress can be expected. If people (clients) start dropping off, things will be over soon. That's why I have said in the past that I believe a "full" run might need more work on the community side of things.

My guess is part of the reason for starting at 4 or five blocks and moving up to 6, then 10 then 20 is that you wanted to test the waters and a proof of concept and to get community traction first, as most would be put off by the daunting element of starting with a 40 block at the very beginning

I myself didn't consider 40 blocks doable at the very beginning (and frankly, I'm still not so sure). The scaling up was a very practical consideration: if the network is random, there is no point in using 256x40 to play random moves. So we can scale the network up as the amount of knowledge there is for it to accumulate grows. The last transition did point out the pitfall: it's possible for the progress to be slowed because the network is too small, which is why I'm increasing size faster now.

More or less agree with the rest of your post.

gcp commented 6 years ago

If we want to do a fresh run, it would make most sense to do it with a 6x128 net. The game generation speed would be huge and we could get it maxed in relatively little time. If a fresh 6x128 net can surpass our last 6x128 net by a lot, we know we should start fresh with 20x256.

Fair point!

It's a real open question if a fresh run would reach an eventual higher end state than our current procedure, and it would be good to find out before attempting 256x40.

But it's unappealing to try when we're pushing the public state of the art. It'll be more appealing when we're stuck, for whatever reason.

gcp commented 6 years ago

Isn't there some problem with your (GPU) RAM and the mini-batch size at some point? Sorry if I got that wrong...

As long as a batch size of 1 can fit there's no problem, and it fits a batch size of 32 for 256x40. This is, perhaps not so coincidentally, the same effective per-GPU batch size that Deep Mind used for some of their efforts (32 per GPU times 64 GPU = 2048 positions per batch, which is where that size in the original paper comes from).

jkiliani commented 6 years ago

I wouldn't be surprised if a lot of the computer Go community, and the Go associations started investing a considerable amount of compute in Leela Zero if that hasn't happened already. For computer Go developers, LZ self-play games and training data will soon be the best freely available training source for their own models, much better than professional games, so they may end up running the client to get access to the data. Go professionals will soon be interested in LZ as a playing partner and analysis tool, events like the match against Haylee tomorrow would help on that end.

Since pushing Leela Zero to higher levels will be very useful for many people with different motivations, I think community support will likely take care of itself now. By the way, does @roy7 have any insight where the current surge in self-play games is coming from? Surely that's not just @alreadydone with BR2?

gcp commented 6 years ago

Isn't it simply because v=3200?

When we made the switch, it was pointed out that because the policy network has gotten more accurate, something like v=5000 would still maintain speed parity, and v=3200 would end up being a big speedup.

That prediction has turned out to be accurate.

bochen2027 commented 6 years ago

Correct about the if people (clients) part, even for open source project, since it is computationally backed by distributed public, its the interest of the majority that needs to be kept to keep going forward in making sure the project is alive. So I'm guessing whatever directions you take have to keep that in mind. Whatever path you end up going, I think in the short term for now, quickly ramping up to larger sizes, as soon as its practical, is the best course as it may give it a sizable bump up, given that LZ has just recently surpassed your own Leela (nonzero) and approaching (and perhaps soon surpassing) other top open source bots like AQ, so seems like the best approach is to squeeze all the juice you can out of this pass if you were to later decide to start a second run. That ways even if interests in second run wanes or somehow it fails, at least you locked in something great with the first run.

Friday9i commented 6 years ago

"But it's unappealing to try when we're pushing the public state of the art. It'll be more appealing when we're stuck, for whatever reason." : clear. An idea: use only 75% of the client's time for pushing further the best network, and dedicate the remaining 25% of the time to restart a 6x128 network from scratch ;-). That would keep collective incentive high while enabling computer ressources for "less appealing" tests!

bochen2027 commented 6 years ago

@Friday9i I like your idea of letting the people choose which meta networks they wish to promote lol!

herazul commented 6 years ago

EDIT - Already two people made almost the same point by the time i finished writing this. Sorry !

I think that going for at least a 20 192 or 20 256 would be very good for the project before starting from scratch : with the progress that 10 layer gave us, next too network size improvement will probably give us a network that will break a lot higher that every bot that exist now (aq, crazy stone, etc...) and stand out in the pro level field : i think it would give A LOT of publicity for leela zero and maybe a lot of contribution that would be beneficial.

odeint commented 6 years ago

Amazing progress overall! The community of this project is doing an awesome job (in particular @gcp of course, who gets an impressive percentage of decisions right). Just wanted to thank you all, it's a delight to watch and new improvements never fail to brighten my day!

afalturki commented 6 years ago

In addition to the -v and -r changes, a reason for the sudden jump in games (pure speculation) might also be that it's spring break right now in the US. So, BR2 is not under the usual load and @alreadydone can request more workers than usual which would make the current rate last for a week only. Hopefully, and most likely this is not true.

Moreover, I'm sure a lot of contributors are either students or faculty in universities and, hopefully, some of them might be able to convince the university supercomputer admins to contribute some idle processing power.

bochen2027 commented 6 years ago

At any given point in time it seems total # clients is hovering above 200 range, do we know what specs are most of these clients? Are they like gtx970 on average or lower or higher?

I had thought up of the idea of someone creating a linux AMI for LZ training and sharing that so people can just sign up for amazon's free tier, micro instance for a year, and use the CPU only version, but alas it throttles to only 10% when used consistently, meaning since CPU is already 10 times slower than average gpu, that you would need 20000 people running AWS "free tier" instances in order to be the same volume as right now with the 200x (assuming all GPU) clients.

@afalturki, a lot of folks are prob donating extra cycles in anticipation for Haylee's match this Thursday. Hopefully we get a few more promoted networks

billyswong commented 6 years ago

@Dorus If we are to do a fresh run, wouldn't it be more fun to start with 10-block, maybe 10x64 or even 10x32 if we want speed? Currently 6x128 nets that we obtained are blind to large group life/death and also ladders. A fresh restart initial on 6x128 will repeat that. On the other hand, a low filter 10 block net will probably handle those okay, which will work as a more balanced AI for kyu players :smile:

ssj-gz commented 6 years ago

Am I being blind/ dim, or is 1934 not showing up on the graph?

Edit: Ah - it overlaps almost exactly with 4dc12 - when I posted, I couldn't even find it by zooming in on elotest XD

jkiliani commented 6 years ago

@billyswong The main point of the fresh run (if it happens) is to test whether it would end up at the same strength for 6x128 as the current one, so we could finally answer the question "Does bootstrapping affect the long-term strength of a network during reinforcement learning?"

If we did somehow end up at a considerably stronger level (which I personally doubt), then it would make sense to start a 20x256 run or something like that from scratch. But if the result is same or similar strength as b3b00c6d, the last 6x128 network, we would have proven that bootstrapping works with no ill effects, and could then continue expanding the net from the first run.

afalturki commented 6 years ago

Why not instead of doing a fresh run, we could try to find ways to break the plateau for a certain size by starting from the last best network in that size and start producing new games while experimenting with game generation and training parameters. For example, increasing randomness to try and fill blind spots arising due to bootstrapping from smaller networks or doing always promote which might help the networks get out of local maxima. Also, a good idea might be to train the best 6x128 network (for example) from the later games of larger network sizes which could help. Doing such experiments might be more interesting and more efficient than starting from scratch.

A good example of network parameters having a great effect on the plateau is what happened at 3.1 million games when the training parameters were fixed.

roy7 commented 6 years ago

If we ever did a fresh run, and if we're willing to break new ground away from Deepmind's paper, I'd still love for some sort of rotational NN to be used so our rotation hassles go away completely. https://github.com/microljy/DREN is one such solution, and has a link to tensorflow code. There are various other papers with other possibly faster approaches too, but DREN has the advantage of existing code to possibly just add and reuse. ;)

I don't think we can easily adapt to a new architecture mid-run, but a restart would allow for dramatic changes like this.

gcp commented 6 years ago

Why not instead of doing a fresh run, we could try to find ways to break the plateau for a certain size by starting from the last best network in that size and start producing new games while experimenting with game generation and training parameters

I think the point is that if you end up at the same strength, you definitely know starting fresh for larger sizes is a no-go!

jkiliani commented 6 years ago

@roy7: I would really prefer a rotational NN would just reduce the symmetry imbalances but not eliminate them completely: The random rotation picked for each node expansion makes the code non-deterministic (for different random seeds) even without Dirichlet noise and temperature, which is very useful: Leela Chess currently does not have any symmetry implemented, and as a result have to resort to Dirichlet noise in evaluation matches just so that the matches don't repeat the same game.

If cyclic symmetry reduces the rotations from 8 to 2, that would be fine, especially if it came with a strength boost.

RavnaBergsndot commented 6 years ago

In the Alphago Zero paper, they took only the last 2 million games and trained a ~4300 ELO supervised network. This probably implies that most worries about forgetting/bootstrap/initial condition are non-issues. The AGZ pipeline is strongly regulated by MCTS, which has the property of not letting bias get in the way most of the time. The bias in move selection is furthermore addressed by the dirichlet noise. My opinion is if it's not broken don't fix it.

roy7 commented 6 years ago

@jkiliani The existing rotational NNs would reduce us from 8 to 2. For transpositions we'd need to add something new to the existing approaches. Still, learning 2 representations of the same board is much better than learning 8.

For randomness in games, I was thinking that similar to my lcb/ucb related experiments, a move could be taken randomly weighted in some way by the relative confidences. So two nearly identical moves might be 50/50, but a move where there's less overlap between best move and 2nd best move will result in a choice weighted towards the best move.

luigio commented 6 years ago

A simple idea for randomness in games: for every move, choose randomly between running n playouts and running n+m playouts, with a small m.

billyswong commented 6 years ago

@jkiliani 6x128 bootstrapping was not done by net2net. We are using net2net now so a fair answer for "Does bootstrapping affect the long-term strength of a network during reinforcement learning?" could only be done by doing 10x128 directly.

jkiliani commented 6 years ago

No I think this is comparable. Sure, 6x128 bootstrap was done without net2net, but since we do treat net2net as a smarter random initialisation, it is the same thing really.

pangafu commented 6 years ago

@gcp congratulations, could we update https://sjeng.org/zero/ again, the 10b network's sgf is reached 500,000, so it's time to train a stronger bigger network.

Dorus commented 6 years ago

@pangafu Sorry you made no sense. An sgf is a file format to store go games. I'm also not sure what 50W means, but we are always straining stronger networks, the training pipeline is automated. We also have a match game right now with various larger networks, so far one failed and one is still ongoing, but since the current winrate is only 56%, it's only marginally stronger. This is not enough to compensate for the fact a larger network will be slower. A slower network can do less playouts, and less playouts means it's weaker again.

pangafu commented 6 years ago

@Dorus yes. I had use v1 program training a 20 128 network (can download from bellow link) with leelazero sgf 2018-03-07, in my test, it can win 4dc12a8e(10 192) 50% to 50% games, so I think maybe V2 pipline training program may have some bugs, and 20 128 may had same strength as 10192.

I use 50W newest data to make the network more stronger, because I think 10b's sgf is much stronger then 6b, after that, I will add some human sgf to training.

Also you can download my trained network in https://pan.baidu.com/s/1kFES0rrFCVh7h-b2XzMaWw#list/path=%2FLeelaZero%2F20-128_%E7%AC%AC%E4%BA%8C%E8%BD%AE%E6%9D%83%E9%87%8D&parentPath=%2FLeelaZero

Dorus commented 6 years ago

@pangafu Very interesting result on the 20x128 net vs the 10x192 one.

I'm still not sure what 50W means.

Notice adding human sgf files to make a stronger net is possible, but will not be done on the real project as it goes against the zero approach.

Also thanks for the download link :)

pangafu commented 6 years ago

@Dorus I just do as @gcp said, use 500,000 sgf window to init n2n network, then use 250,000 sgf to make the network more accurate. But sgf at 2018-03-07 is only contained about 60,000 10B sgf.

Sorry, in chinese, 1W = 10,000

And I just do some test about train a human network base on zero, see what will happen...Maybe train a network like a alphago master move...

jkiliani commented 6 years ago

Before we switch to a larger net, I think #1036 should be tracked down and resolved.

Dorus commented 6 years ago

Something is really wrong with how these larger nets are trained :S

I hope this is just a bug in the training pipeline. We've never seen anything like that in previous net2net attempts.

jkiliani commented 6 years ago

I don't think so... I think this is client related, or it would show up in all matches consistently.

Dorus commented 6 years ago

But if it's client related i would expect it to also show up for the current 10b network. However all games you linked have the wrong passes only on the new network.

jkiliani commented 6 years ago

I already explained my hypothesis, meaning that both bootstrap sizes cause an overflow of some sort for the enlarged network size (and for very select clients), but not yet for 10x128. I may be wrong, but the data looks like this to me.

gcp commented 6 years ago

Not much we can do about broken clients.

gcp commented 6 years ago

use 500,000 sgf window to init n2n network,

I don't use the SGF at all as they don't have the move probabilities from the search. You want this: https://github.com/gcp/leela-zero/issues/167

pangafu commented 6 years ago

@gcp so if I want the last self-play data, just download the last file? http://leela.online-go.com/training/train_545ca6d6.zip

alreadydone commented 6 years ago

@jkiliani @afalturki Yes it's indeed spring break here, and I feel that I am able to use more GPUs more often. At the time there were 4200+ games in past hour I was using 384 GPUs (the maximum number I'm ever able to use, also achieved before), generating about 2600 games per hour. However at this moment it has dropped down to 156 GPUs; apparently people are getting work done during the break. (I have a script to submit jobs automatically, probably so do they.) By the way there will be a scheduled monthly maintenance starting from 0:00 on March 18 (EDT) which lasts about 18 hours.

I ran full tuner for 192 filters last night to get benchmarks for the 10x192 net using genmove b (thinking at most 50.0 seconds) or netbench (1600 evaluations). ~~Compared to the results in https://github.com/gcp/leela-zero/issues/965#issuecomment-370022071, I've run full tuner for 128 filters in the meantime.~~ Tesla K20 @ 705MHz: 6x128: 1194 n/s, 61%, netbench 677 n/s 10x128: 895 n/s, 72%, netbench 458 n/s 10x192: 548 n/s, 84%, netbench 260 n/s 20x128: 556 n/s, 85%, netbench 254 n/s

Tesla K20X @ 732MHz: (6.5% of the GPUs on BR2 are the faster K20X) 6x128: 1240 n/s, 62%, netbench 719 n/s 10x128: 935 n/s, 67%, netbench 485 n/s 10x192: 600 n/s, 79%, netbench 287 n/s (config copied from K20 full tuning, so maybe still room for improvement) 20x128: 574 n/s, 79%, netbench 265 n/s

jkiliani commented 6 years ago

Thanks for the heads up @alreadydone. Since it looks like the 4000+/h game rate a few days ago was the exception rather than the rule, should we take the time until 128x20 is retrained and can be properly tested against 192x10 before deciding on a bootstrap? Chances are good that 20 blocks is the bigger improvement, and for the moment there may be quite a bit of potential left in 128x10.

I'm also a bit concerned that #1036 could lead to a (probably) small percentage of self-play games corrupted after the bootstrap if the cause remains unknown.

bochen2027 commented 6 years ago

come on guys lets pump out a new network for haylee's match tomorrow... do some quantiative easing and temporarily lower the learning rate or something lol /s

Dorus commented 6 years ago

@hydrogenpi I'm not too worried. The current net is already very strong, and good at ladders. Recents nets seem to switch between relatively good and bad at ladders, would be embarrassing of we had a new promotion tomorrow with terrible ladder skill.

That said, the next set of matches has a good shot to get a promotion. With our current game rate, it's hard not to promote haha.

Also i believe the guy that will run LeelaZ is going to run some heavy hardware.

pangafu commented 6 years ago

@gcp in #167, it seem that the data is training data after dump_supervised, but I still want some newest origin sgf in https://sjeng.org/zero/ to concat with some other human sgf, then dump it to new training data.

bochen2027 commented 6 years ago

@Dorus, yes, 4x 1080. But, he says its a server in China and that the remoting into it has lag. Dunno how much actual effective performance he can get out of it. He will publish the logs so we will all find out tomorrow. I also noted the p3 (V100) instances on AWS EC2 doesn't work as well as if one was using a dedicated graphics card physically. And actual tests have shown that using 4x V100 (basically four titanVs) on aws gets only 60+ to 90 elo boost compared to just one 1080 card, at around 20000pp/s for AQ on 4xV100 vs 5000pp/s for the 1x1080. So AQ doesn't scale as well as people initially thought. Back to LZ, my thoughts are that even using 4x 1080 won't be too much stronger than the single gtx1080ti's already tested on cgos.

Ultimately I have no doubt LZ will reach a level far beyond average professionals. but part of it is about publicity. Even if DeepMind tested each of the three versions of its AG, with Fan, Lee and Ke, but a bit too early and premature, the end result is the same, that eventually AGZ will surpass all humans, but from the PR standpoint it would have been a reverse 180, complete different if AG had lost rather than won against its matches with the three pros. (because obviously on its way up it was once upon a time weaker than all of them) so from a pr perspective, timing is important, its important to win.

After all, Haylee, through a retired pro, is active in the community, and a "win" could gain LZ more public good will and exposure and get it to the next level in terms of community involvement, everybit helps, it certainly would be much more perferable to losing. Once word is out that its the first public open source 'zero' bot to beat pros and other pros start playing it, then that would be a sort of top down advertisting that could usher in the kind of community support that could enable ultimately a 40 block.

fell111 commented 6 years ago

wow, the 20x128 has the same performance as 10x192. It's easier for us to promote next network size. We only need to arrange a match between the best 10x192 and 20x128 is enough.

AAPMTG306 commented 6 years ago

As far as I know, LZ had already beaten professional players in China. I think it have very good chance beating haylee if it doesn't run into those ladders or self atari.

bochen2027 commented 6 years ago

@pheasant75 wow this is first I heard of LZ beating a pro player. This is link to the game you have? How pro is he/she? Like what Goratings ranking? And do you know what recent network of LZ was used, was it on hard spec hardware fastly or average gtx? That is interesting LZ already beating pro players so soon. Last month it was still way behind AQ, and there was doubts of whether or not it was stronger than Leela11, and that 256*20 network just came out, the one the dude spent 1 month of his gpu time training... and now LZ is stronger than all the above lol

Thanks!

leela-zero / leela-zero

Bigger, stronger, not faster (4dc12a8e etc) #1030