Open ASilver opened 4 years ago
Sure, I guess you can ask me, and then future people can read this post. :)
For general self-play settings, you can look here: https://github.com/lightvector/KataGo/tree/master/cpp/configs/training
selfplay8a.cfg
was used up to shortly after b15c192 started selfplayselfplay8b.cfg
was used until starting 20 blocksselfplay8b20.cfg
was used thereafter. Steps and volume of games (or rather, data rows) are in the names of the nets themselves.
If you're asking about "per iteration", then there is no answer. There is not really a such thing as an "iteration" in KataGo. The entire run proceeds asynchronously with everything running at the same time. Selfplay is always running continuously using the latest net, and even switching mid-game if the latest net updates. Training is always running continuously snapshotting itself to output new nets, without waiting for any particular number of games in selfplay.
The batch size for the entire run has been 256.
LR schedule started at lr_scale 1.0
(https://github.com/lightvector/KataGo/blob/master/python/model.py#L1662) and lr_scale 1.0 was used for most of the run (there's a tiny, almost negligible bit at the start where it's lower to limit early large gradient steps). At this point for the main self-play nets it's been tapered down to a current value of 0.3, pretty gradually. The exact schedule doesn't make a huge difference, I think, at least for such gradual changes like this.
I have a question regarding increasing the block size.
In the link below, when the block size changes, the number of steps decreases and the cumulative number of data continues to increase. https://d3dndmfyhecmj0.cloudfront.net/g65/models/index.html
[8.4M] b6c96-s103408384-d26419149.zip
[24M] b10c128-s20033280-d26738073.zip
...
[24M] b10c128-s101899520-d60734663.zip
[81M] b15c192-s40750592-d62386709.zip
I think you trained a large block, using self-play data from a specific point in time that was previously created with a small block. Is that right?
And I am curious as to what criteria you have started training big blocks. Did you start training with larger blocks when the loss were saturated or the match performance did not improve?
Thank you.
Good question. I would need to look up what I did for such an old run, but for g104 and g170, with the first size being b6c96, I semi-arbitrarily waited half a day before starting the b10c128, but then thereafter, every new neural net size was usually started exactly when switching to the previous size for selfplay. In other words:
Doing so matches nicely if you happen to have precisely two machines for training. When you switch is precisely when that GPU becomes free because you no longer need the small net, so that's a natural time to use that GPU to begin the next larger size, so your two GPUs are always occupied training precisely the current size and the next size. Of course now you can see with both the b30 and b40 nets and the various extended training that I've become a bit more flexible in having different GPUs doing different things, but for a lot of my original and smaller-scale testing I stuck with two GPUs, and so this was the natural way to run it, and I've stuck with that since because it seems to work fine.
Is there a place where one can find information on the training details such as steps, batch sizes, LR schedule and/or the volume of games used each iteration?