lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.6k stars 569 forks source link

Training details or info? #222

Open ASilver opened 4 years ago

ASilver commented 4 years ago

Is there a place where one can find information on the training details such as steps, batch sizes, LR schedule and/or the volume of games used each iteration?

lightvector commented 4 years ago

Sure, I guess you can ask me, and then future people can read this post. :)

isseebx123 commented 4 years ago

I have a question regarding increasing the block size.

In the link below, when the block size changes, the number of steps decreases and the cumulative number of data continues to increase. https://d3dndmfyhecmj0.cloudfront.net/g65/models/index.html

[8.4M] b6c96-s103408384-d26419149.zip
[24M] b10c128-s20033280-d26738073.zip
...
[24M] b10c128-s101899520-d60734663.zip
[81M] b15c192-s40750592-d62386709.zip

I think you trained a large block, using self-play data from a specific point in time that was previously created with a small block. Is that right?

And I am curious as to what criteria you have started training big blocks. Did you start training with larger blocks when the loss were saturated or the match performance did not improve?

Thank you.

lightvector commented 4 years ago

Good question. I would need to look up what I did for such an old run, but for g104 and g170, with the first size being b6c96, I semi-arbitrarily waited half a day before starting the b10c128, but then thereafter, every new neural net size was usually started exactly when switching to the previous size for selfplay. In other words:

Doing so matches nicely if you happen to have precisely two machines for training. When you switch is precisely when that GPU becomes free because you no longer need the small net, so that's a natural time to use that GPU to begin the next larger size, so your two GPUs are always occupied training precisely the current size and the next size. Of course now you can see with both the b30 and b40 nets and the various extended training that I've become a bit more flexible in having different GPUs doing different things, but for a lot of my original and smaller-scale testing I stuck with two GPUs, and so this was the natural way to run it, and I've stuck with that since because it seems to work fine.