leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.27k stars 1.01k forks source link

How to find Information about the new best network #78

Open sethtroisi opened 6 years ago

sethtroisi commented 6 years ago

I noticed a new network (fe3f6...) in http://zero.sjeng.org/networks/

Can you provide some information about win rate over the previous best network? Number of games it was trained on? How long it took to train?

I'm thirsty for details :)

gcp commented 6 years ago
leelaz-9k v leelaz-19k (176/2000 games)
board size: 19   komi: 7.5
             wins              black         white       avg cpu
leelaz-9k      65 36.93%       37 42.53%     28 31.46%   2150.07
leelaz-19k    111 63.07%       61 68.54%     50 57.47%   2263.43
                               98 55.68%     78 44.32%

19k games. Learning on 38k is running now.

I observed that the network now thinks white has an advantage in the opening. I think this is because it learned that if black passes before capturing much or gaining territory (not something it understands at this point), white will win on komi.

sethtroisi commented 6 years ago

Thanks for the information.

I see that the network file is named based on the number of games it was trained on which will help answer this question in the future.

gcp commented 6 years ago

The files on the server are named after the hash of the contents, though. I just do this to keep track of which is which. I also tested a smaller network (to control for overfitting) but it was not better.

sethtroisi commented 6 years ago

I was referenced "19k.txt" inside of fe3f6...gz

lithander commented 6 years ago

I appreciate updates like that. I think it will help to keep contributers motivated to start up autogtp.exe when they get some feedback on how their contribution helps to make progress.

Btw, how good would the current network (19k) play against a human?

HaochenLiu commented 6 years ago

Including the win rate info in the best network would be great.

@lithander I don't think the 19k network is better than a human beginner.

gcp commented 6 years ago

It barely knows how to count I think.

olbrichj commented 6 years ago

I was quite curious and replayed a few of the games. Interestingly it seems like the newest version has a small understanding of specific shapes.

gcp commented 6 years ago

The learning now has the policy network achieving a 4% prediction rate, which is very far from random (0.3%). I wonder if this is just learning to understand what the legal moves are (many training games have almost-filled boards) or if it can already statically see some capture and defense moves.

jkiliani commented 6 years ago

The learning now has the policy network achieving a 4% prediction rate, which is very far from random (0.3%).

Could you please clarify what prediction rate means? Is this in regards to a dataset of human professional games (GoKifu), as they used in Fig.3 of https://www.nature.com/articles/nature24270?

gcp commented 6 years ago

It's a prediction rate over the dataset of games from the program itself. So in 4% of the cases, it correctly guesses the move it would eventually play after searching. This is a sign play is starting to become more non-random. (Or, as said above, maybe simply that the network now understands you can't place stones on top of each other)

pcengine commented 6 years ago

Hi, you mentioned that learning on 38k is running now, and I wonder how much time will this 38k learning process take? Could you give an estimation based on your experience? Thank you.

roy7 commented 6 years ago

Any idea what the prediction rate would be for a strong fully trained Leela Zero? Did you ever happen to try loading in Leela's own human game data into Leela Zero's architecture just to see what happens?

gcp commented 6 years ago

That's how the supervised network in the README.md was built. It does about 52.5% on humans. But the prediction rates for those vary a bit with the exact dataset. Also it trained for only a few days. You can probably get quite a bit more by running it for a few weeks.

The prediction rate for a Zero that is trained by supervised learning (i.e. what we're building now) should be less, because it won't predict the bad moves those puny humans play by imitation.

Marcin1960 commented 6 years ago

"The 19K game network beats the 9k game network 63% of the time. A 38K network is training now."

I wonder, should the older less informed games to be discarded at certain moments?

gcp commented 6 years ago

Alpha Go Zero used a window of 500k games IIRC.

Marcin1960 commented 6 years ago

1/10th for Leela Zero? 50K?

featurecat commented 6 years ago

Are you sure about using a 500k game window instead of a 100-300k window? Because our network is only 6 blocks, it should improve faster.

gcp commented 6 years ago

I'm not sure about anything. But it's important to keep a window of the old games, or the network forgets the basic things it has learned before. This is a very common problem for reinforcement learning setups.

jkiliani commented 6 years ago

You could probably experiment with the window size, by testing two networks with the same recent data but different numbers of older games against each other. I doubt this would make any sense before we have at least 150-200k games though.

sethtroisi commented 6 years ago

I might suggest a slight wording change given the confusion over kyu some people had

"The 19K game network beats the 9k game network 63% of the time. A 38K network is training now." => The 19K game network beats the 9k game network 63% of the time. A new network is being training now on the first 38K games.

Matuiss2 commented 6 years ago

Replacing K for T, to mark a thousand games wouldn't be a bad idea too since K is a rank measurement in Baduk.

roy7 commented 6 years ago

Sorry. Although I play Go, saying 9K in that context didn't even occur to me it might confuse people. I just changed it to be full numbers with no abbreviation.

sbbdms commented 6 years ago

Is the network which learns from 38k games still under training? There are more than 61k games in the database now. I wonder if the next network which is used for AutoGTP will directly be the one which learns from 60k games or so.

Marcin1960 commented 6 years ago

To keep the progression I would see 76K as next :)

gcp commented 6 years ago
leelaz-19k v leelaz-38k2 (123/1000 games)
unknown results: 1 0.81%
board size: 19   komi: 7.5
              wins              black         white       avg cpu
leelaz-19k      64 52.03%       28 44.44%     36 60.00%   1441.24
leelaz-38k      58 47.15%       23 38.33%     35 55.56%   1527.08
                                51 41.46%     71 57.72%
leelaz-19k v leelaz-49k2 (47/2000 games)
board size: 19   komi: 7.5
              wins              black         white       avg cpu
leelaz-19k      25 53.19%       14 58.33%     11 47.83%   2581.61
leelaz-49k      22 46.81%       12 52.17%     10 41.67%   2577.97
                                26 55.32%     21 44.68%

The current ones did not beat the 19k games network yet. So clearly it's not all so easy. I am retraining 49k with stronger regularization, and starting 62k soon.

godmoves commented 6 years ago

@gcp Can you publish some games between different networks? I think it might be a way to motivate people who join the training of leelaz.

killerducky commented 6 years ago

@godmoves it hasn't been updated for awhile but you can get them here: https://sjeng.org/zero/

jkiliani commented 6 years ago

The strength evolution curve in the Deepmind paper is not strictly monotonous. That suggests to me that they must have allowed their network to update occasionally when the evaluator did not prove a strength increase, presumably to get out of local maxima.

gcp commented 6 years ago

Well, they did not allow the self-players to update, the whole story about the regular tests and 55% margin are about that. But you are right: the non-monotonous curve suggests that they also saw what we are seeing now, that is, their testing regularly failed to find that network x + y was stronger than network x until y was quite a bit bigger.

gcp commented 6 years ago

Can you publish some games between different networks? I think it might be a way to motivate people who join the training of leelaz.

And the networks themselves are all here: http://zero.sjeng.org/networks

I'll update the dump of games in a few minutes. (Edit: Done)

Marcin1960 commented 6 years ago

@jkiliani

Yes, it is how the biological evolution works, populations/species get stuck in not optimal maximas:

https://conversionxl.com/wp-content/uploads/2015/09/locmax-1-568x338.png

BTW, would a shortcut be against the spirit of the project to nudge this evolution by adding games od dumb humans weaker than 13kyu? They would not pass human tradition and assumptions as they are naive and ignorant of them? But at least they know that starting at the edge is better than center?

gcp commented 6 years ago

I have no idea what you are trying to accomplish by that, or even why you think it is already terminally stuck in a local maximum, or why you think you wouldn't force it to an even "worse" (in terms of optimization) plateau, or why you think the result would be interesting, or why you would want to take a "shortcut" to 13kyu instead of to 9 dan...

So no, not going to happen.

kityanhem commented 6 years ago

if the LZ-62k can't beat LZ-19k with 70% win rate, how many games you think the new network need to beat LZ-19k?

kityanhem commented 6 years ago

I think you should build a tool. It auto training the new network when the games reach some milestone like 10k, 20k, 60k, 100k, 140k games and so on... (each +40k games i think). And the tool auto make LZ new network play with the old one 100 games or more up to you (50 games for new network as b, 50 games as w) .If the win rate of new network is 63-70% or more, the tool auto update the new network to clients and if not, the tool will wait for the next milestone to training new network again and make it play with the old one (ex: like you do now 38k can't win 19k and wait for 49k, still not win wait for 62k, etc...). You just need to fix some bug or error and see L-Zero growing up and no need to always focus on training the new network. It's just my thought.

isty2e commented 6 years ago

I wonder how much of statistical noise can be involved during training and play-out evaluation process. Can it be something we must take into account?

gcp commented 6 years ago

@kityanhem The goal is to script the whole thing together so it runs continuously. In issue #1 there's some discussion (and a prototype) for tooling to terminate the evaluation match at the statistically earliest moment that is sound. But right now there are some manual steps. The procedure must be understood (and if necessary, debugged) well before automating it.

I have no idea about the game counts. I am a bit surprised that with 49k games vs 19k games it did not win easily like in epoch 1 and 2.

@isty2e I have no idea what you are asking. The play-out games are extremely noisy. Training is also rather noisy but in general will converge well due to the shape of the optimization. Testing matches are noisy but the theory for estimating the bounds is well known.

Marcin1960 commented 6 years ago

If I may, I am a biologist interested in the evolution. I am skeptical about random mutation, selections scheme. It is much too simple, as anyone who played with DarwinBots simulation. To promote couple useful genes, you need thousands of generations. It is too slow.

Most of evolutionary processing took couple billions of years on sub-cellular level (complex proteins and organelles), then the rest on cellular level). Only very late multi-cell organisms appeared. This preliminary process was faster by MANY orders of magnitude. Why? Size and multitude of organisms multiplied by fast exchange of generations. Multi-cell evolution by random mutations would be almost infinitely slower.

My opinion is that random mutation scheme took place before organelles were formed. Later it was rather similar to smart object assembly,

That is why I suspect that makers of AlphaGo Zero "cheated" a little. To avoid contamination with human assumptions and tradition it would be enough to take code like GnuGo and remove human optimizations what would result in level of not very smart beginner, perhaps around 15 kyu. Then I would generate 50 or 100k games and use it as a starting point. From this moment it would be self-training without human input. To start at the level just before organelles were "invented".

Marcin1960 commented 6 years ago

I know it is a shortcut, but one can always return to the point Zero, especially when BOINC is implemented.

fsh commented 6 years ago

That is why I suspect that makers of AlphaGo Zero "cheated" a little.

According to the paper, the network seems to play randomly but has learned to capture stones by 3 hours. This was the 20-block version and it generated a total of 4.9 games over 72 hours. So 3 hours in, that's around 200k games. So apparently by 200k games it learned to capture stones: i.e. it has possibly learned some instance of counting.

Now, my question is, why do we need to recreate this phase exactly? Google was making a point. Seemingly we're not out to make a point.

I've been skeptical of adding more noise/input to this project/comment chain because GCP is probably getting swamped with all this by now, but here goes:

I think this problem is fairly unique at the very start/boot-up phase of the network like now. The problem being that the value-evaluation output of the network is just going to be a random number from a purely random network. So MCTS does next to nothing; MCTS doesn't actually teach it anything. The network absolutely needs to learn what a "point" is before it can go anywhere, before MCTS can add any intelligence or direction at all and for training on self-played game + MCTS data to have any meaning. Otherwise 1000 MCTS expansions or 1 million, it doesn't matter, I think in practice it will be the same as we had just a single (1) expansion. It will in either case be the same as just picking a random move, and then teaching the next generation of the network to pick that random move if the ultimate result turned out to be a win (usually yes for white, usually no for black, but with the teeny tiniest possibility of variation for when that move is (randomly) actually a capture or something that avoids a capture).

Now, maybe AZ was just lucky in that their initial random initialization somehow did already know of several "points" that could be had. Or-- maybe they cheated in their initialization. (Or they had some more clever/intelligent way of scoring the games in the very beginning rather than Tromp-Taylor.) They state black on white it was initialized with random weights, but...?

I'm not suggesting training on any human data at all. What I would suggest is to generate a ton (thousands, millions) of random 'final position' patterns (for example just by random MC simulations up until eye-filling) and train the initial random network with these, given the calculated winner as target output for the value-evaluation. It doesn't add human data, it's simply to teach the network to actually count so that MCTS has some meaning, and in this learn-how-to-capture-stones phase the MCTS playouts will actually guide it toward moves that do lead to increased score (i.e. captures, avoiding capture, etc.). It will figure out dead stones, L&D, etc. later, but it absolutely must learn what a "point" is.

Because I do question the idea of spending weeks of GPU time to do ~600 × 1000 MCTS expansions that are in essence completely random/futile until by some numerical accident after several hundred thousands of games it actually learns what a point is according to TT rules.

kityanhem commented 6 years ago

What about LZ-62k now? is it stronger than LZ-19k?

isty2e commented 6 years ago

@fsh I don't think the initialization matters in the long term. As you can see in the paper, the early stage spends only a small fraction of the time, when we aim to achieve a high rating (>4000). Otherwise we could've started with pre-trained network, cheating a bit.

fsh commented 6 years ago

@fsh I don't think the initialization matters in the long term.

Well... First of all I have to say that I am not sure either. That's why I am cautious and qualifying my suggestion. But if you are right (I suspect the same), then...exactly my point. That's why I suggest doing some sort of primitive initialization to get the value-output to be non-random in this super-bootstrap phase.

As you can see in the paper, the early stage spends only a small fraction of the time

Yes, like I quoted, it "only" takes the 20-block AZ roughly 200k games until it's playing for captures -- i.e. it has learned that capturing stones gives more likely chance to win. It's a small fraction, a mere 3 hours for Google, a drop in the ocean, but how long did it take us, with 1000+ people contributing, to generate even 60k games? And it is my hypothesis that this stage is not doing anything but simply nudging random numbers around until by some accident it learns what a "point" is.

Otherwise we could've started with pre-trained network, cheating a bit.

No! Please understand me: this is very different. I am not suggesting pre-trained network or any sort of go-play data at all, neither by humans (even "dumb human 13-kyus" (wtf dude)), or other bots. After all, it could be the case that it is actually important that the network is trained from the ground-up, to learn some qualitatively different strategy than what is used by humans (as Google hinted to it did). What I was suggesting was merely speeding up this extreme initial stage, i.e. Google's first 3 hours (i.e. our first week or weeks), with random final MC positions (and output to random move vectors or indeed, ignore those gradients, to not make it biased toward pass for example; the only thing that should be targeted is the v-value). For example generate 200k of these positions through MC: it would kind of be like just auto-simulating those early random games (i.e. with node-expansions=1 instead of node-expansions=1000, because until the value-evaluations ceases to be random it doesn't matter)...

This project has generated a lot of excitement, including in me, but I guarantee you a lot of that excitement evaporated when the 39k-network turned out to show no improvement over 50-kyu play. Again, I suspect, because this style of 100% reliance on value-output causes the MCTS to not do anything useful until the value-evaluations are less random. I suspect at this stage, until MCTS is actually doing something useless, what we are doing is simply wasting 99% of our computations (which Google can afford to do when they want to make a point with big fanfare in Nature and it is only a few hours for them).

isty2e commented 6 years ago

@fsh What are we trying to achieve with project? Presumably a strong Go AI, built without human knowledge. A "strong" AI, I mean. A rough assumption suggests that we need 4 years to reach AG:Z level at this rate. Considering that, spending a month or two is not a big deal. Of course, we might come up with faster hardwares, but still, this initialization thing wouldn't matter too much.

And please, that 39k network was just a single step. Are you expecting an ever-improving network? You're going to see a lot of failed networks. Why bother with single one?

fsh commented 6 years ago

Presumably a strong Go AI, built without human knowledge.

Sure! (But there is another question in there: how strongly are we trying to make a point of it being without human knowledge which Google did for obvious reasons to show off the power of AI. (I.e. the second part of your sentence.) By the way, I agree with it on some level, that's why I suggested what I did. We're using MCTS as a technique to teach it the rules of Go. That's not cheating, is it? I just suggested something along those lines.)

Considering that, spending a month or two is not a big deal.

I agree with you! I am well aware of the scope and ambition. Though my position is simply just that we don't really need to, and we could exploit this current excitement or hype wave to better effect if we didn't spend a month first making a bot that simply knows that a stone is a point, potentially wasting a ton of GPU-cycles. That was my point at the end there. I am sorry if I came across differently. For the record, I am still very excited.

And please, that 39k network was just a single step. Are you expecting an ever-improving network? You're going to see a lot of failed networks. Why bother with single one?

Am I expecting ever-improving network? Of course not. Will we see a lot of failed networks? Sure. Why bother with a single one? I wanted to post this long before the 38k-network was trained, simply because I thought that all this is unnecessary. I held back because since everyone was so excited and there was a storm of spam and suggestions and comments. So even now I feel like apologizing for adding to it. But this is not me panicking and trying to pull some random crazy idea out of my ass because of a single training step failure. Again: it was simply a suggestion to stop wasting so much computing power by following the very letter of the Nature-paper, because we presumably do not care so much about Google's primary point, we want to train an actual Go AI (that may be strong and qualitatively different than bots trained on human games), not wage an elaborate PR campaign about AI to the educated public.

evanroberts85 commented 6 years ago

I agree that the MCTS is not proving very useful in the end game, and without learning to count the early game is meaningless too. I’d suggest upping the number of payouts significantly, either for the whole game or alternatively just near the end, the beginning for now could be either completely random or use just a small number of payouts. This would I feel bring much faster progrress, even if it meant fewer (but higher quality) games

50

evanroberts85 commented 6 years ago

Another idea would be to look through the games to find those where the MCTS has come up with some useful data (i.e it can clearly see two paths, a losing path and a winning one) then backtrack a few moves then play out another 100 games from that position, hoping to generate a lot more but slightly different games that are also very useful for training.

sethtroisi commented 6 years ago

I found a quick way to looked at the L1 norm of each row perl -anle '$x+=abs($) for(@F);print $.." ".$x; $x=0;' weights.txt perl -anle '$x+=abs($) for(@F);print $x;$x=0;' autogtp/fe3f6afd....

gcp commented 6 years ago
leelaz-19k v leelaz-62k (116/1000 games)
board size: 19   komi: 7.5
             wins              black         white       avg cpu
leelaz-19k     23 19.83%       14 23.73%     9  15.79%    467.12
leelaz-62k     93 80.17%       48 84.21%     45 76.27%    477.19
                               62 53.45%     54 46.55%
kityanhem commented 6 years ago

Horray! Is it 30k now? is it know how to catch a stone?