Closed marcocalignano closed 5 years ago
we can not judge whether a network is pure computer self-play
Those networks generally weren't enough stronger that the speed drop makes up for it. If they were, we would have just switched to them. But with ELF there was no question. We didn't switch to that because we don't have code to train it (among other reasons...)
I think one of the reasons why we don't switch is that we have trained over such a long time, and just using ELF or other network may waste the efforts, which generate LZ's own style of playing go. 50/50 is a good way to learn strong moves without losing LZ's own style or strengths (presumably).
@gcp But if you look at the charts of @Friday9i in #1113 you can come up with a lower number of visit for these network that would reduce the speed drop. (These graphs also show that if we are around 1600 playouts now the ELF network needs only 315 playouts to be as strong as L131, so you could reduce the visit number to 1800 and still have a better network that deliver a game faster.)
Yeah, it's true that we could run ELF with lower visits. We just added it quickly with a 1-liner on the server side. I didn't think it was clear for the 256x20's? But I did not test them all.
It's pretty simple to have the server send ELF self-play tasks run with a different amount of visits. Although keeping it at the usual 3200 means we should be improving training quite a bit more than plain ELF.
I disagree, I think would be better to have more games from a less stronger network then little games from a really strong network. BTW this also imply that if we run let's say 20% of our self play games with a higher visits count we would also get stronger game out of our network. Maybe (if we finally use the autogtp statistics patch) we can give this stronger games to the fister clients.
I was just referring to the quality of training data. Increasing the quantity of training data from a much stronger network could be be beneficial too. Better? Maybe?
For the statistics patch, I seem to recall that the approach there was to have a unique identifier for each client's device. The server can estimate how fast an IP address can complete tasks (different from the current throughput measurement) without persisting tracking ids.
This links up well with the fact that AZ used 'only' 800 visits (po ?) for self play, but in conjunction with a 20b nn. Seems like using large nn from start is a penalty for bootstrapping but once the nn is knowledgeable, it scales faster in terms of training data quality and "improvement operator" than a smaller one. So moving fast to 20b+ nn for self play makes sense. On the other hand, current LZ strength sky rocketting curve does not urge for any rush change. Contributors might be happy with that trend for a while ;-)
Contributors might be happy with that trend for a while ;-)
I'm sure everyone would be happy with this, but I cannot imagine LZ on 192x15 catching up to ELF, which was presumably at its reinforcement learning limit on 224x20. Even if the temperature change turns out to give an actual performance ceiling boost for even games, I'd be surprised to actually overtake ELF without bootstrapping to 256x20 first.
Storing client IP address raises legal issues (at least in EU, with coming data privacy regulation), but client's IP could be one-way hashed and the hash stored with game. For any purpose (send weights according to client stats, detecting/ tracing broken clients, bad games, etc...).
We currently store the client's IP in the database to be able to recover in case of a spamming/flooding attack, which is a use that GDPR allows and doesn't need prior consent. But yes, you can't rely on the original data being present there longer term. Throttling is short term though so it would work.
In my patch the stats are done on the client and sent every time the client request a new task. In that moment the server does not need to store IP addresses but just use the stats the client provide to choose which task to deliver.
Those networks generally weren't enough stronger that the speed drop makes up for it.
Not being a stronger player at time parity does not necessarily mean not being a better teacher at time parity! The current promotion wave shows how essential selfplay labeling quality is.
(And AG's search with its low visit counts may even represent a somewhat better labeling system than LZ search with the same visit counts, who knows.)
Not being a stronger player at time parity does not necessarily mean not being a better teacher at time parity! The current promotion wave shows how essential selfplay labeling quality is.
This sounds nice but it does not make for a logically sound argument. ELF is much stronger even at time parity.
Not being a stronger player at time parity does not necessarily mean not being a better teacher at time parity! The current promotion wave shows how essential selfplay labeling quality is.
The current promotion wave is due to ELF games and t=1 in unknown proportions. Attributing it the ELF self-play exclusively and deducing from there that increasing visits is a good idea seems like a weak logical conclusion to me, in particular since the change is still so recent.
@gcp Are we aiming to get a particular number of ELF games (e.g. half of a training window) and stop producing more at that point, and just mix the existing ELF games with a shifting window of LZ games?
It seems imprudent to allow ELF games to grow to more than half the window. I'm hoping LZ is pretty close by then. If not I'll probably limit ELF to the latest 250k (when dumping) but allow new to come in to provide more data.
There's a good possibility we can't get to ELF without another size increase though.
Do you think we can actually get pretty close to ELF on 192x15? I'd find that surprising, since presumably ELF is at or very near the skill limit of 224x20...
See edit, no, we might stall before. I hope @bjiyxo can keep updating his effort.
Training on more diverse (ELF+LZ) games could also be an advantage in itself, as is when the games were played by a different net than what is being taught. OC both affect learning speed mostly, but may even push the the limits a bit as well (pure selfplay may not completely reach the theoretical peak of a net structure, esp. with weak search).
Since ELF helps LZ so much, does it mean that 15x192 capacity is very far from being exhausted and that the real bottle neck is self-play?
@Marcin1960 I thought the same. It seems like ELF at a small 20b is suprassing others at 40b and 60b... (pheonix, fineart, golaxy) So I hope there is much room left in the 15b!
Agreed that eventually will need to go to 20b, but I'm not sure how prudent it is to use a 20b by a third party individual who originally net2net it from a 6block. Maybe its time to do a brand new net2net to 20 b from scratch in the post-ELF era?
If ELF could reach this level at 20b, I think it just means that they figured out how to correctly reproduce the Alphago Zero approach, and that apparently, FineArt and PhoenixGo didn't. Just using 40 blocks doesn't give you a strong program, if there's a problem e.g. with the UCT search, training parameters, or something else.
This just means that FineArt still is very far from Alphago Zero 40 blocks... there's a whole lot of improvement potential left in computer Go.
@hydrogenpi "Maybe its time to do a brand new net2net to 20 b from scratch in the post-ELF era?"
You seem to be very eager to deviate from the present course of LZ project in various ways, as soon as possible.
@Marcin1960 where is the deviation? If the idea is to go to 20 block, and to use net2net to do so, what issue do you have with net2net the ecab 15 block to 20block and go from there rather than versus using a 20b that originally was net2net'd from a (now ancient) 6 block? Where is the logic in that? I don't see how in any way this could be conceivably perceived as a "deviation" to suggest using newer arch to net2net rather than much older net/arch.
I'm not sure whether I should keep going 20x256. One reason is that GCP may not use my 20x256 to replace 15x192. Another reason is that there are some issues (i.e. ladder issues in Golaxy game 6) there and we don't know if it can be fixed by self-play. And last but not least, 20x256 may not be much more stronger than ELF's weights, so training another 20x256 might be useless. So I'm a little bit lost now...
I think the ladder issues are probably fixable by self play at least.
One reason is that GCP may not use my 20x256 to replace 15x192.
I am certainly planning to do that when a) 192x15 runs out, but we don't know when b) it's not a lot worse than 192x15 at that point (this has never happened for your weights, but it happened for some of mine!)
Another reason is that there are some issues (i.e. ladder issues in Golaxy game 6) there and we don't know if it can be fixed by self-play.
I wouldn't read too much into that as we have seen for example that ELF can also exhibit them.
And last but not least, 20x256 may not be much more stronger than ELF's weights, so training another 20x256 might be useless.
That might be true if ELF indeed reached the limit of 224x20, though one could hope we can train a 256x20 that is better in handicap games.
@bjiyxo Your 20b experiment still proved of value even though it wasn't directly used since in my opinion it helped to pursuade the official project to move upwards in size sooner than it otherwise would have, which since 10b in retrospect was also saturated it was a good move. Many people say "time doesn't matter" as if there was all the time in the world but in the real world that isn't the case and its always a factor. Not saying we are in a Go AI arms race, but time of course matters in everything in real life.
Would it be condusive to experiment with a 40b or even a mid size between 20b and 40b?
256x20 may not become much stronger than ELF weights, but it should become considerably stronger than LZ 192x15 weights. I do not think the effort to train it wasted at all, on the contrary @bjiyxo has contributed very significantly to this project regarding methods to bootstrap larger networks from self-play games.
About the current 256x20 net, wouldn't it be much more promising to get this network trained as high as possible with the ELF and t=1 self-play games so it can take over from 192x15 when that architecture's ceiling is reached, rather than training another net from scratch? Going higher on residual blocks could still be done later. Also, a Leela Zero 256x20 net at capacity may only be somewhat stronger than ELF, but likely much stronger at giving handicap due to the temperature change.
I'm now training 40x256 instead of 20x256 because ELF may reach the limit of 20blocks. I'm still hesitating about whether I should keep training 20x256. In fact, LZ is growing rapidly now and maybe I should run autogtp instead of training 20x256.
Maybe @gcp would be willing to train up your network from the last checkpoint then? Would seem a shame to have this network go to waste after all the effort you put in it...
In fact, LZ is growing rapidly now and maybe I should run autogtp instead of training 20x256.
There are 812 people running autogtp, there are 0 people training up a >15x192 network (that I know of).
As to whether 256x20 or 256x40 is best right now, I do not know.
@bjiyxo " LZ is growing rapidly now and maybe I should run autogtp instead of training 20x256."
Definitely! 15x192 should be a priority until its potential is REALLY exhausted.
BTW, my selfish reason is that 20x256 is too slow on my hardware. I am not going to buy a new PC in nearest time and if this becomes a requirement I will have to drop out.
@gcp What's the current proportion of selfplay games generated with ELF weights vs generated with regular 15x192 LZ ?
Should be very close to 50%. Maybe a bit lower for ELF if people didn't update the client.
@Marcin1960 Just so you know, I'm also not blessed with powerful hardware here, but I can still run LZ on autogtp even if it takes forever, and Lizzie even if I only have a couple hundred visits per move instead of the thousands that GTX 1080 users get. I'm all for continuing 192x15 until it reaches its limit, I'm just not into squeezing the last Elo out of it if that compute could more productively go into advancing the project further. Deciding on the best architecture is a decision that those who can train such nets should decide between them, but upgrading to something larger once you stall looks like a no-brainer from my perspective.
An interesting thing @Mardak found when looking at ladders is it seems ELF just avoids ladders totally (very low priors) where as LZ reads the ladders out to the end (very high priors). This also means LZ can play a winning ladder if there's a ladder breaker, but ELF won't.
Then I will keep training both 20x256 and 40x256. So there will be another new 20x256 a few days later.
I would like to stay on 192x15 as much as possible for the sames reasons as @Marcin1960 . Seeing the sharp rise we have right now, I would like to do an experiment once we have moved to a bigger network for some time: Try to train our good old 192x15 with all the games from the bigger/better network to see if we can squeeze some more elo from it. Actually we could even try that right now on a smaller scale with the 128x10, 128x8, and even 64x5 networks just to test this crazy idea. But I don't have the horsepower and the know how to do so by myself :(
@Cabu "Actually we could even try that right now on a smaller scale with the 128x10, 128x8, and even 64x5 networks just to test this crazy idea. But I don't have the horsepower and to know how to do so by myself :(
Bingo!
I would train a few 128x10 nets. The result can clarify many questions or might SURPRISE us !
It would be a very interesting experiment.
Training smaller network faster so why don't we train 128x6 nets to test. One training by using self-play games of 15b, one training by using self-play games of ELF, we can find out:
Maybe with this way we can make some nets smaller but stronger.
I looked through a couple of new ELF self-play games played by LZ 0.15, and excluding 1-visit moves seems to have made a huge difference. All of the games I saw now look reasonable, and the majority aren't resigned at move 92 anymore either, which seems really promising as well. Maybe we should phase the 0.14 ELF games out of the window eventually once there are enough 0.15 games to fill half the training window?
So I did the experiment: the same network (L135) against itself but one side got double as much visit count. So this was the command lines:
./leelaz -g -v 6401 --noponder -t 1 -q -d -r 0 -w net_doublevisit
vs
./leelaz -g -v 3201 --noponder -t 1 -q -d -r 0 -w net_normalvisit
and this was the result:
60 wins, 26 losses
The first net is better than the second
net_doub v net_norm ( 86 games)
wins black white
net_doub 60 69.77% 32 66.67% 28 73.68%
net_norm 26 30.23% 16 33.33% 10 26.32%
48 55.81% 38 44.19%
and the calculate ELO difference is 63.
After this result that confirm that the same network could play better self-play game with more visit count
we need to evaluate if it is worth to increase the visit count (just for a little percentage) of the self-play on fast clients or maybe it is not worth the pain.
60 - 26 is an Elo difference of 145 according to http://www.3dkingdoms.com/chess/elo.htm, although a result like this would still have a very significant error bar attached to it. Even so, I have significant doubts that it would be worth giving up a factor of 2 in game generation speed, especially when it's already so low. The difference of 1 visit to 3200 visits seems to be in the range of ~1500 Elo (?), and that is the reference strength gap to compare against, since we fit the raw net to approximate the visit distribution of the search output.
Didn't @Ttl estimate the optimal visit counts for maximum strength gain / time from the shape of the visit distribution over total visits a while ago? I remember that this estimate at least suggested that less visits are more efficient...
Yes it was 500 visits, see @Ttl's posts here https://github.com/gcp/leela-zero/issues/1348#issuecomment-386865798 https://github.com/gcp/leela-zero/issues/1030#issuecomment-374246946
Thanks for running that. I believe the previous estimates for doubling visits for 128x10 (?) was about 200 Elo difference. The primary purpose of self-play with some amount of visits for search is it's a policy improvement operator, so yes, increasing visits would definitely help generate stronger training data at a cost of less games / training data.
But I would think part of the reason for including ELF generated self-play is that it's just more efficient at producing higher quality self-play. Using @Friday9i results in https://github.com/gcp/leela-zero/issues/1113#issuecomment-387311283, the orange dot shows ELF with ~320 visits is as strong as a 192x15 with ~5x more visits (1600 visits). The orange line never goes below 2, so if we estimate 224x20 slowdown compared to 192x15 to be 2x, this means even accounting for size / slowdown, ELF will generate higher quality self-play than just searching with more visits with 192x15.
Edit: Yes, as a long-term plan when ELF isn't as useful as a teacher, increasing visits could help, although one would probably need to rerun the analysis of 6400 vs 3200 visits at that point.
I know that ELF is so much better, but I was exploring a possibility to get stronger self-play games even after we reach the ELF strength. I also wanted to see if the more visit theory was also statistically proven.
Just doing a quick test of doubling of low visits (100 vs 50) with 192x15 LZ136 and 224x20 ELF:
51 wins, 18 losses
The first net is better than the second
double v 192x15 ( 77 games)
wins black white
double 55 71.43% 23 76.67% 32 68.09%
192x15 22 28.57% 7 23.33% 15 31.91%
30 38.96% 47 61.04%
51 wins, 18 losses
The first net is better than the second
double v 224x20 ( 69 games)
wins black white
double 51 73.91% 26 72.22% 25 75.76%
224x20 18 26.09% 10 27.78% 8 24.24%
36 52.17% 33 47.83%
Those results are +159 Elo and +181 Elo respectively with ±12% margin of error.
But just as a stop-gap alternative to switching away from ELF when LZ reaches ELF levels is to generate ELF self-play with doubled visits too -- similar to marcocalignano's proposal.
Since the use of ELF network seems to yield so good result why don't we use even other networks. For example we could keep training the 20b network from @bjiyxo and when is better than the current, we could also use it for self-play games to push the actual network?. Maybe someone could start training a 40b network.