Experience training new networks on data from scratch

I left my computer on when on vacation to generate games with the new version. Unfortunately, katago got oom killed so I had 25K games to work with. This is 9x9 only, with 100 cheap visits and 800 expensive. All networks are 20b256c.

My first attempt was to learn on 25K games, but the network didn't seem to increase in strength much after a dozen networks at 1x learning rate, I cut the learning rate to 0.1x, then 0.03x and it still was much weaker than the older version.

Then I generated 25K more games, learned in the same way, tested - and it wasn't good enough. I repeated this process until I had around 120K games or so (a little over 2 million rows), but it didn't reach more than 15% vs. my best old net.

I restarted because I thought it would be better to use those 120K games to train a new network that sampled the old network games equally (since they all come from the same network). I noticed again diminishing returns at 1x learning rate quickly, so instead I did this:

Learn at 1x
Export + gatekeeper
Learn at 0.1x
Export + gatekeeper
Learn at 0.01x
Export + gatekeeper

I noticed that my higher learning rate attempts actually increase the loss, but when I get around to the 0.01x it reduces the loss to a new minimum.

I did this and I kept getting more gains, but after 2 days I figured it would be enough - I don't want to overfit the 120K games either.

The first attempt had 30 generated networks. The second attempt had 53. The old version had 117 networks at 20b.

I tested this second network vs. my first attempt and it got 100.5 - 17.5 pass in gatekeeper. Learning in this circular learning rate manner actually made my network much stronger!

Against the old version it lost 94-100, but this is close enough to a tie to call it a success. In other words, you can use as few as 2 million rows of data to train a superhuman AI just as good as a gradual process over a million games (tens of millions of rows) starting with weaker ones.

it seems that for learning, you do need a decent sample size of at least 200,000 rows, maybe even more to prevent overfitting to the data

the default settings of 1M samples per epoch are maybe a bit much, I train three times (but sometimes with lower learning rate, so it's not linear) and I like about as many samples as rows, or up to twice as much, so 200K-400K samples for my loop

you can tell when the learning rate resets to lower with each loop, it means the difference is overfitting

like if it's 60% pacc1 at the start, goes up to 75% and drops to 60%, I'm doing too much learning or the amount of data is not enough

it should go up to like 62% and drop to 60% on the next set of data - that represents a realistic "data got more difficult" change

lightvector / KataGo

Experience training new networks on data from scratch #150