Version 0.10 released - Next steps

gcp commented 6 years ago

Version 0.10 is released now. If no major bugs surface in the next few days the server will start enforcing this version.

There is this 1500+ post issue where most plans for the future were posted in the past. It's become rather problematic to read, especially on mobile, and mixed with a lot of theories (most not backed with any data or experiments 😉) so I'll post my plans and thoughts for the near future in this issue.

It looks like we're slowly reaching the maximum what 64x5 is capable of. I will let this run until about 2/3 of the training window is from the same network without improvement, and then drop the learning rate. I expect that's the last time we can do that and (maybe!) see some improvement.

I have been training a 128x6 network starting from the bug-fixed data (i.e. starting around eebb910d) and gradually moving it up to present day. Once 64x5 has completely stalled, I will see if I can get the 128x6 to beat it. If that works out, we can just continue from there and effectively skip the first 6500 Elo and see how much higher we can get (and perhaps do the same with even bigger networks) from continuing the current run.

If that kind of bootstrapping turns out not to work, I'd be interested in doing a new run. My ideas for that right now:

Somewhere between 128x6 and 128x10 sized network. 128x10 would be 8 times slower, but there is a ~2x speed improvement that we could expect to have merged in by then, and the total running time would be around half a year maybe? "Short" enough that people are probably mostly going to stick around. Hopefully also "big" enough that we can see pro level play.
Immediately use new networks for self play (i.e. according to the latest AZ paper). We see very strong strength see-sawing right now. It is possible that using the new network immediately lets the learning figure out why some of those are bad and thus produce faster improvement. It's also possible that this procedure produces no or very slow improvement for our unsynchronized distributed set up and this run ends up being a total failure. But I think we should try to find this out, in the interest of answering the question in case anyone ever tries a "full" 256x20 run on BOINC or an improved version of this project.
Small revision of weights format. There is a redundant bias layer (convolution output before BN) that needs to go, and I want to add a shift layer after the BatchNorm layers. The latter hasn't been generally shown to provide improvements (and I never found any in Go either), but it is computationally almost free and makes the design more generic, so we might as well include it. (Note that scale layers are completely redundant in the AGZ architecture so no point in adding those)
There's been a demonstration that instead of stopping at 1600 playouts per move, it may be more computationally efficient to stop at 2200 "visits" per move. So we should do that.

Thanks to all who have contributed computation power and code contributions to the project so far. We've validated that the AlphaGo approach is reproducible in a distributed setting - even if only on smaller scale - and made a dan player appear out of thin air.

Some personal words:

I have been very, very happy with the quality and extent of code contributions so far. It seems that many of you have found the codebase approachable enough to make major enhancements, or use it as a base for further learning or other experiments about Go or machine learning. I could not have hoped for a more positive outcome in that regard. My initial estimate was that 10-50 people would run the client, maybe one person would submit build fixes, and that would be it. Clearly, I was off by an order of magnitude, and I'm spending much more time than foreseen on doing things like reviewing pull requests etc. So please have some patience in that regard - I will keep trying to do those thoroughly.

For the people who have a lot of ideas and like to argue: convincing, actionable data (or even better, code that can be tested for effectiveness) will make my opinion flip-flop like the best/worst politician, whereas arguing with words only is likely to be as fun and effective as slamming your head against a wall repeatedly.

Miscellaneous:

I am very interested in any ideas or contributions that make me more redundant for this project. I have some ideas of my own that I want to test. My wife would also like to see me again!

The training and server portions run fully automatically now for the most part (cough @roy7), although some other things like uploading training data have been proven problematic to automate, so that won't be live for the foreseeable future either.

There's been a lot of concern about bad actors, vandalism, broken clients, etc, but so far the learning seems to be simply robust against this. There is now some ability to start filtering bad training data, but it remains tricky to make this solid and not give too much false positives. I'd advise only worrying when there are actual problems.

RavnaBergsndot commented 6 years ago

How does the 128x6 network training work? If we keep the same training pipeline as 64x5, it needs testing self-plays. Would the self-plays be done by some contributing clients, or would you do all of them yourself?

ssj-gz commented 6 years ago

"Once 64x5 has completely stalled, I will see if I can get the 128x6 to beat it."

Does it beat it now? :)

marcocalignano commented 6 years ago

@gcp Thanks!

gcp commented 6 years ago

Would the self-plays be done by some contributing clients, or would you do all of them yourself?

I can upload the networks and schedule tests for them, same as it happens for the regular networks. The clients won't really notice, they'll just run a bit slower :-)

john45678 commented 6 years ago

@gcp thanks.

jkiliani commented 6 years ago

Thank you for running this project, it's been a delight to follow and contribute wherever possible!

About the increase in network size: Is there any good way to test in which cases increasing number of filters helps more, and where the Deepmind approach "Stack more layers" is better? In other words, is there a significant possibility that something like 64x10 might reach similar strength as 128x6? Is there any way other than training supervised nets to find out?

fishcu commented 6 years ago

GCP and everyone involved, thank you very much for all your efforts! This project has been truly fascinating to follow it, both as a Go player as well as a developer. Looking forward to further experiments.

Matuiss2 commented 6 years ago

Thks for the computer only version im generating games in 17 minutes instead of 5 hours XD, what a massive improvement!

grolich commented 6 years ago

Thank you for running and managing this wonderful project :)

evanroberts85 commented 6 years ago

Plans sound good. Just to be clear, with the 128x6. network are you moving up by the intevals of the 65x5 “best” networks using the 250k game window that those networs were trained upon? Also are you using a set number of steps? I guess this should work but other approaches are likely to work better.

Also, this will not really tell us if the difference in strength is due to the size of the networks or the differences in how they are trained.

RavnaBergsndot commented 6 years ago

Is there any good way to test in which cases increasing number of filters helps more, and where the Deepmind approach "Stack more layers" is better?

Isn't Deepmind's approach stacking both more filters and more layers? AGZ has 256 filters.

gcp commented 6 years ago

About the increase in network size: Is there any good way to test in which cases increasing number of filters helps more, and where the Deepmind approach "Stack more layers" is better? In other words, is there a significant possibility that something like 64x10 might reach similar strength as 128x6? Is there any way other than training supervised nets to find out?

I believe that in general stacking deeper is more attractive for the same (theoretical!) computational effort. You leave more opportunity to develop "higher level" features (or not, when not needed, especially in a resnet where inputs are forwarded!), or possibility for features to spread out their influence spatially. Deeper stacks are harder to train, but ResNets and BN appear to be pretty good at dealing with that.

But in terms of computational efficiency, a larger amount of filters tends to behave better, especially on big GPUs, because that part of the computation goes in parallel. The layers need to be processed serially.

"In theory" 128 filters are 4 times slower than 64 filters, but in practice, the difference is going to be much smaller.

gcp commented 6 years ago

Isn't Deepmind's approach stacking both more filters and more layers? AGZ has 256 filters.

They did 256x20 and 256x40. They did not do 384x20, for example.

gcp commented 6 years ago

Just to be clear, with the 128x6. network are you moving up by the intevals of the 65x5 “best” networks using the 250k game window that those networs were trained upon? I guess this should work but other approaches are likely to work better.

No, I started with a huge window and have been narrowing it to 250k.

barrtgt commented 6 years ago

Thanks for the fantastic work! I'm interested in knowing how results for 2200 visits bit were obtained. Also, has anyone trained a supervised network with different depths and filters?

gcp commented 6 years ago

I'm interested in knowing how results for 2200 visits bit were obtained

See the discussion in #546. There's still some work ongoing in this area, and further testing, but it looks promising. The idea is not to spend too much effort in lines that are very forced anyway.

isty2e commented 6 years ago

First I would like to thank @gcp and all contributors for this awesome project and efforts. Now we can prepare for the next run, and here are things I would like to clarify or discuss:

AGZ uses rectifier nonlinearity, while we are currently using ReLU if I am not terribly wrong. For the new network, it could be desirable to change the activation from ReLU to a nonlinear one, but unfortunately the AGZ paper lacks details about this. What would be our choice? There are many options like LeakyReLU, CReLU, or ELU.
For the training window, I still do not see any advantage we gain from including data from too weak networks. In addition to the training window by number of games, how about filtering out games based on rating also (like 300 or any reasonable value)?
I am still dubious if the AZ approach is reproducible for us, when the computational resource is fluctuating. A milder approach would be to accept-if-not-too-bad one, and to prioritize networks with more training steps. Is it reasonable enough?
For networks with more filters, #523 must be merged somehow. What should be done to accomplish this as fast as possible?
This question probably cannot be answered without any experiment, but I always have been thinking that 8-moves history for the ko detection is too much, though more feature planes will lead to a stronger AI somehow practically. Can we consider reducing the input dimension from current 8x2+2 to a smaller one, like 4x2+2?

Dorus commented 6 years ago

A milder approach would be to accept-if-not-too-bad one, and to prioritize networks with more training steps. Is it reasonable enough?

But the goal of AZ is to eliminate evaluation matches. If you need to know "not-too-bad", you need evaluation matches and you could as well go full AGZ. (This is pretty much the opposite argument of what we used to reject switching to the AZ method this run)

Also, shouldn't we try to reproduce AZ exactly because "we're not sure if it is reproducible"? If we change all kind of things and then fail (or succeed), we still do not know if it is because we changed a bunch of stuff or because it's an inherently bad method.

Anyway, before we start with a new larger network. How viable would it be to do one or a few runs with a smaller network, but with some variables adapted. For example we could use the current games and train a 3x32 network, and then run 500k games. After that run a 3x32 network from scratch and run 1m games to see the result. (And would these results carry over to larger networks?)

Another experiment i like to see would be to try different window sizes. We could use the current 5x64 network for that. Just go back 1m games, and train the then best network with a 100k or 500k window (or possibly 2 runs with each window size), and then run 500k games or so.

We're at 43k games/day now, so experiments like that would take ~2 weeks, but might give valuable data on our next run that might take several months. using a 3x32 network could probably 4 fold our game output and only take a couple days to get meaningful result.

RavnaBergsndot commented 6 years ago

AGZ uses rectifier nonlinearity, while we are currently using ReLU if I am not terribly wrong. For the new network, it could be desirable to change the activation from ReLU to a nonlinear one, but unfortunately the AGZ paper lacks details about this. What would be our choice? There are many options like LeakyReLU, CReLU, or ELU.

BN dramatically reduces, if not totally eliminates the assumed advantages of other fancy activations over ReLU. This is why all those huge CNNs(Res-101, 1201, etc) prefer trying all kinds of different structures and filter/layer combinations rather than exploiting the seemingly low-hanging fruits of better activation functions. They are not low-hanging fruits because they only offer advantages in some non-general cases and controlled environments.

By the way ReLU is a nonlinear function, and two layers of ReLU could theoretically approximate all continuous functions, just like tanh and sigmoid.

This question probably cannot be answered without any experiment, but I always have been thinking that 8-moves history for the ko detection is too much, though more feature planes will lead to a stronger AI somehow practically. Can we consider reducing the input dimension from current 8x2+2 to a smaller one, like 4x2+2?

4x2+2 can't detect triple-ko.

evanroberts85 commented 6 years ago

I started [the 128x6 network] with a huge window and have been narrowing it.

You could repeat the same process but with a 64x5 network like the current one, to see how much of the gain (if there is a gain) comes from the increase in network size and what effect just changing the training had.

isty2e commented 6 years ago

@Dorus It is true that there is no evaluation in AZ, but I am not sure if it is the purpose. In fact, the motivation for changes from AGZ to AZ is unclear in the paper.

@RavnaBergsndot That is theoretically true to an extent, but in practice it affects the performance more or less, usually depending on the nature of the dataset. A simple example would be this. After all, this project aims to be a faithful replication of AGZ, so why not? Also AFAIK a triple ko consist of 6 moves so 3x2+2 will do, and we are not adopting superko rules, so is it any meaningful to detect a triple ko after all?

barrtgt commented 6 years ago

A consistent training procedure with no added variables would be nice to compare different configurations. I think the 5x64 has quite a bit more potential, but was ham-stringed by a rough start. I like the idea of the AZ method of using the latest network. I vote to do a small scale AZ approach first.

RavnaBergsndot commented 6 years ago

That is theoretically true to an extent, but in practice it affects the performance more or less, usually depending on the nature of the dataset. A simple example would be this. After all, this project aims to be a faithful replication of AGZ, so why not? Also AFAIK a triple ko consist of 6 moves so 3x2+2 will do, and we are not adopting superko rules, so is it any meaningful to detect a triple ko after all?

Most of the experiments in that paper were done without BN. BN enforces most input points falling into the most interesting part of the ReLU domain, therefore reduces the need of non-zero value outputs when the input is negative. We need more recent experiments.

I'm also not convinced that AGZ's "rectifier nonlinearity" means "rectifier plus nonlinearity on its negative domain" instead of just ReLU itself.

3x2+2 won't do, because that "x2" part is for the same turn. "These planes are concatenated together to give input features st = [Xt, Yt, Xt−1, Yt−1, ..., Xt−7, Yt−7, C]." Therefore for 6 moves, we need at least 6x2+2.

ddyer0 commented 6 years ago

I feel the need for a more general discussion, more beginner friendly and less specifically about leela. Please have a look at https://www.game-ai-forum.org/viewforum.php?f=21

isty2e commented 6 years ago

@RavnaBergsndot Well but the batch normalization was there in CIFAR-100 benchmark. And if the shift variable is somehow set inappropriately during training, the batchnorm layer can shift the input to the negative region of ReLU ("dying ReLU"), so that is the idea behind all the modified rectifier units. I can hardly imagine any case referring to ReLU by the "rectifier nonlinearity", because you know, ReLU is a rectifier linear unit.

And you are right about the input features, though I still do not see why we need to detect triple ko at the first place.

grolich commented 6 years ago

though I still do not see why we need to detect triple ko at the first place.

Triple kos matter in rule systems without superkos.

Actually, they change the result of the game. With superko, they would just make a move illegal.

Without it, they form a (really interesting in a game, I might add) situation where, if neither player is willing to give way, the game cannot end and is declared a draw (actually a "no result", but in situations where no return-matches are played, it's effectively the same).

Not including enough information for triple-ko detection in the NN would make the network unable to tell the difference between a situation where a move would end the game without a win or a loss, and one that will.

So even if we aren't interested in superko, it's still a bare minimum to be able to detect triple ko.

That being said, It might help a lot in superko detection as well, since gapped repetitions are exceedingly rare in actual play, perhaps sufficiently so that the "damage" of not recognizing these cases without search might not be felt.

However, the important thing was to demonstrate why triple ko detection is needed even if we do not use superko.

ayssia commented 6 years ago

Why 128 filters? 24 blocks 64 filters should be consuming same time as 6 128, I wonder how blocks/filters affect strength... Maybe we can train a 24 64 network and a 6 128 network, to compare with them?

jkiliani commented 6 years ago

64 filters, 24 blocks will almost certainly use more time than 128 filters, 6 blocks. @gcp explained earlier that increasing the number of filters allows more parallelization and is thus usually much less than quadratic in computation time on a GPU. Layers have to be evaluated serially on the other hand.

gcp commented 6 years ago

I'm also not convinced that AGZ's "rectifier nonlinearity" means "rectifier plus nonlinearity on its negative domain" instead of just ReLU itself...I can hardly imagine any case referring to ReLU by the "rectifier nonlinearity", because you know, ReLU is a rectifier linear unit.

I'm 99.9% sure that "rectifier nonlinearity" exactly means ReLU. ReLU is a non-linear unit constructed from a rectifier and a linear unit. A rectified linear unit is a rectifier non-linearity.

As was already pointed out, the advantages of "more advanced" activation units disappear when there are BN layers involved, which is why everyone including DeepMind just uses BN+ReLU.

gcp commented 6 years ago

2) In addition to the training window by number of games, how about filtering out games based on rating also (like 300 or any reasonable value)?

It's important to make sure the window has enough data or you will get catastrophic over-fitting, especially for the value heads. You can test this yourself. This can't be guaranteed if you introduce a rating cutoff so it's a bad idea.

gcp commented 6 years ago

You could repeat the same process but with a 64x5 network like the current one, to see how much of the gain (if there is a gain) comes from the increase in network size and what effect just changing the training had.

Be my guest and be sure to let us know the result.

isty2e commented 6 years ago

While it is true that BN mitigates the dying ReLU problem a lot (and especially considering we are using ResNet) and therefore BN+ReLU practically works very well, it is not completely true that the architecture is completely free from the problem. Of course, if there is no problem with the current LZ nets, the change is unlikely to be made anyways.

For the training window, overfitting can be of course potentially problematic, but the counter-argument that it can be bad to learn from bad policies and results of weaker games also makes sense, and we don't really know if overfitting is severe after all, so both are not well supported by data, I would say. So what kind of experiment is good enough here? Training with a smaller window is fine, but doing that for several generations with self-plays from that trained network is nearly impossible for an individual. So if we restrict the experiment to be done for a single generation, how can we measure the strength? Self-play ratings are not necessarily applicable to non-LZ players. Is the match between networks trained with narrower and wider windows meaningful enough?

ashinpan commented 6 years ago

@gcp "Once 64x5 has completely stalled, I will see if I can get the 128x6 to beat it."

IMHO 128x6 doesn't even need to beat 64x5. Suppose you find out that 128x6 is about 300, 500, 700, or even 1000 elos lower than 64x5. It still means the former can play reasonably. Then, we can just adopt it and improve it by training on its own self-play games. It would still be much better than starting from the scratch.

ashinpan commented 6 years ago

@gcp When you decide that we should move to 128x6, you can pitch it against at least 3 best networks (latest but about 1000 elos apart). Then we can decide the exact elo of the initial 128x6, which should be our starting point.

Dorus commented 6 years ago

Why put that burden on gcp? Dont be lazy and just run it yourself @ashinpan :)

Or just wait for one of our other enthusiasts to do so, i'm 100% sure somebody will.

gcp commented 6 years ago

IMHO 128x6 doesn't even need to beat 64x5.

If it can't from a similar training set, then what's the point of moving to 128x6 - with the same training set?

ashinpan commented 6 years ago

@Dorus We haven't reached the complete stall yet, and it is @gcp who must decide that we actually have. Besides, he just can send out matches to do such a test; he doesn't need to do anything.

ashinpan commented 6 years ago

@gcp Have you read my comment to the end?

jkiliani commented 6 years ago

If 128x6 trained by supervised learning can't beat 64x5 trained to saturation by reinforcement learning, that mainly implies that the supervised learning can't absorb all the knowledge from games it didn't play itself. It certainly doesn't mean that such a net wouldn't beat 64x5 in short order once trained by reinforcement learning itself.

ashinpan commented 6 years ago

@jkiliani I agree with you.

fffasttime commented 6 years ago

I found leelaz has a tendency to forget learnd knowledge. Though current weight 65e94e52 much stronger than before on midgame, earlier 40b94cfe seems playing better on endgame. Since 58da6176 beat 40b94cfe by midgame, the endgame playing of leelaz improved so slowly. If we train a network based on previous network like AlphaZero, could it work? Or any better way to solve it?

jkiliani commented 6 years ago

@fffasttime I think if your observation is correct, it simply means there is still considerable improvement potential in 64x5. What will likely happen is that eventually the learning process won't produce stronger networks anymore at 0.001 learning rate, but that with reduced rate, the networks will reach slightly higher mid game strength than now combined with higher endgame strength than 40b94cfe. We'll see what happens, but this run definitely doesn't seem to be quite over yet.

isty2e commented 6 years ago

The reason behind self-forgetting is highly likely due to network capability or learning rate. As suggested in the OP, we can try lowering the learning rate, and if it still stalls we might safely conclude that this is close to the limit of the current architecture.

ashinpan commented 6 years ago

@fffasttime It is well-known that Alphago also makes quirky endgame moves, probably owing to the same cause here. Perhaps this is the motivation for Alphazero adopting the method of training on the last network.

jkiliani commented 6 years ago

Maybe we should reconsider the resigning... it's possible we'd get better results letting all games play to the end, maybe with just 400 playouts after one of the players falls below the resignation threshold.

In either case, reinforcement learning so far appears to be remarkably robust in that it fixes its own weaknesses even with the presence of bad data. I doubt the problem will persist.

evanroberts85 commented 6 years ago

A new network could understand that one move is bad, but does not yet know that an alternative is even worse, because it has not been played much before in training games. This reminds me of the phrase "A little knowledge is a dangerous thing". In this case training with the new network despite its new weakness may be beneficial.

That said training with a network which scores only <30% will mean that then following network will need to score 70+% just to get back to where we were, assuming Elo ratings work cumulatively (which they do not). I can not see this leading to faster overall progress given how badly the average network does.

I am optimistic about the 128x6 network myself. I will try to get training running on my laptop tonight to do some tests but without a good GPU I am not sure if I can get reasonable results fast enough.

ashinpan commented 6 years ago

@gcp (Sorry, I just noticed your edited comment) "If it can't from a similar training set, then what's the point of moving to 128x6 - with the same training set?" I never said that it should be the same training set. We can start training on the new self-playing games of 128x6. (If I am not wrong, this was how AG Master was born)

gcp commented 6 years ago

I never said that it should be the same training set.

But I am saying that it should. If training 128x6 on the training data that is beyond saturating 64x5 does not produce an improvement, then this implies that data is sub-optimal for that network. And we should reset rather than use data we known is not optimal (and get risk getting stuck in a lower optimum).

If 128x6 trained by supervised learning can't beat 64x5 trained to saturation by reinforcement learning, that mainly implies that the supervised learning can't absorb all the knowledge from games it didn't play itself. It certainly doesn't mean that such a net wouldn't beat 64x5 in short order once trained by reinforcement learning itself.

Exactly. My point is that if 128x6 cannot (somehow) use the training data from the 64x5 well enough, we should get new data, and not try recover a half-crippled net.

gcp commented 6 years ago

A 4th point for a new run would be to extend the training data format to include the resign analysis. Shouldn't forget about that either.

gcp commented 6 years ago

If I am not wrong, this was how AG Master was born

...and we get back to the open question, that if AG Master with 256x20 was better than AGZ with 256x20 (this can be inferred from the graphs in the papers), why did DeepMind do AGZ 256x40, and not AG Master 256x40.

leela-zero / leela-zero

Version 0.10 released - Next steps #591