Training details or info?

ASilver commented 4 years ago

Is there a place where one can find information on the training details such as steps, batch sizes, LR schedule and/or the volume of games used each iteration?

lightvector commented 4 years ago

Sure, I guess you can ask me, and then future people can read this post. :)

For general self-play settings, you can look here: https://github.com/lightvector/KataGo/tree/master/cpp/configs/training
- selfplay8a.cfg was used up to shortly after b15c192 started selfplay
- selfplay8b.cfg was used until starting 20 blocks
- selfplay8b20.cfg was used thereafter.
- Although mid-run there have been some mild changes to one or two things in these configs, almost all of the settings are accurate and what I would use again if re-doing this run.
Steps and volume of games (or rather, data rows) are in the names of the nets themselves.
- See the readme file within this directory for the naming convention of the net and what "s" and "d" mean.
- Since KataGo trains on a mix of all different board sizes, whose average game lengths vary wildly, "number of games" is a stat that I haven't tried as hard to track as simply data rows, but as of the nets in the v1.4.0 release there were about 19.5 million games, so you can get an idea of the ratio with "d".
- If you're trying to compare with other bots though, keep in mind these details that complicate the comparison a lot:
- As mentioned before KataGo plays many games on smaller board sizes than 19x19, which finish faster. Small games are still very useful for 19x19 due to transfer learning, but of course their ratio of policy/value data and other things are very different.
- Some games are started a little after the opening, due to a small amount of branching of games and due to some raw-policy initialization of the first moves in games.
- About 75% of the moves in games are searched "cheaply" with a very low visit cap ("cheap" search, or "playout cap randomization" in this paper.
If you're asking about "per iteration", then there is no answer. There is not really a such thing as an "iteration" in KataGo. The entire run proceeds asynchronously with everything running at the same time. Selfplay is always running continuously using the latest net, and even switching mid-game if the latest net updates. Training is always running continuously snapshotting itself to output new nets, without waiting for any particular number of games in selfplay.
The batch size for the entire run has been 256.
LR schedule started at lr_scale 1.0 (https://github.com/lightvector/KataGo/blob/master/python/model.py#L1662) and lr_scale 1.0 was used for most of the run (there's a tiny, almost negligible bit at the start where it's lower to limit early large gradient steps). At this point for the main self-play nets it's been tapered down to a current value of 0.3, pretty gradually. The exact schedule doesn't make a huge difference, I think, at least for such gradual changes like this.

isseebx123 commented 4 years ago

I have a question regarding increasing the block size.

In the link below, when the block size changes, the number of steps decreases and the cumulative number of data continues to increase. https://d3dndmfyhecmj0.cloudfront.net/g65/models/index.html

[8.4M] b6c96-s103408384-d26419149.zip
[24M] b10c128-s20033280-d26738073.zip
...
[24M] b10c128-s101899520-d60734663.zip
[81M] b15c192-s40750592-d62386709.zip

I think you trained a large block, using self-play data from a specific point in time that was previously created with a small block. Is that right?

And I am curious as to what criteria you have started training big blocks. Did you start training with larger blocks when the loss were saturated or the match performance did not improve?

Thank you.

lightvector commented 4 years ago

Good question. I would need to look up what I did for such an old run, but for g104 and g170, with the first size being b6c96, I semi-arbitrarily waited half a day before starting the b10c128, but then thereafter, every new neural net size was usually started exactly when switching to the previous size for selfplay. In other words:

Currently, selfplay with b6c96 and training b10c128 to eventually take over. ... later
Switch to b10c128 and simultaneously now start b15c192 to eventually take over ... later
Switch to b15c192 and simultaneously now start b20c256 to eventually take over ... and so on

Doing so matches nicely if you happen to have precisely two machines for training. When you switch is precisely when that GPU becomes free because you no longer need the small net, so that's a natural time to use that GPU to begin the next larger size, so your two GPUs are always occupied training precisely the current size and the next size. Of course now you can see with both the b30 and b40 nets and the various extended training that I've become a bit more flexible in having different GPUs doing different things, but for a lot of my original and smaller-scale testing I stuck with two GPUs, and so this was the natural way to run it, and I've stuck with that since because it seems to work fine.

lightvector / KataGo

Training details or info? #222