Consider active learning approach

brilee commented 6 years ago

Instead of training on all game data indiscriminately, use the prior-Q and prior-policy recorded values to deduce how "unexpected" the actual played move was. If we filter for all the interesting moves where MCTS figured something out, then we may be able to accelerate the training process.

Ishinoshita commented 5 years ago

@brilee It strikes for a while indeed that none of the known projects (AZ, LZ, MG) has tried some sort of "prioritized experience replay" , by sampling non-uniformally within the training window.

For example, and just to illustrate, to have two buckets of training samples, and to sample, say 50%-50% from each of them. First bucket would contain new positions and positions already used for training but for which the loss on last inference, vs training target, was less than some loss threshold (low surprise level). The second would contain positions for which the loss was higher than said loss threshold.

A priori, there would be no risk of overfitting as those positions would be removed from the second bucket after n_max iteration or when net better predict the position evaluation.

Has anything like in this line been tested, to your knowledge? Thanks.

brilee commented 5 years ago

This general class of idea is called a 'baseline', and existing examples work by subtracting a baseline (I expected to win with 0.95 probability) from the eventual result (I won -> training target = 1). So, that way, the gradient update size created by this particular example is much smaller than usual. If the result is unexpected (big divergence from baseline), then it gets a bigger weight in gradient update.

alreadydone commented 5 years ago

Isn't "ground truth" more appropriate ...

brilee commented 5 years ago

The word 'baseline' is what you'll want to search to actually find previous literature on the topic.

Ishinoshita commented 5 years ago

@brilee I see what you mean by a baseline. Was used to train the second iteration of the policy net in the AG original paper, IIRC.

But was I was trying to describe is looping more times over 'surprising' positions than over well predicted ones. Not sure I was clear enough, unless this actually would amount to some baseline technique, in which case I fail to see how. Regards.

brilee commented 5 years ago

I see what you're saying. I think the baseline approach effectively does what you want. The gradients per example are linearly correlated to the final error, so if you train on examples that are "surprising", then the model is naturally training harder towards those examples. Then, it just becomes a question of computational efficiency to drop the examples that are unsurprising because they're not contributing much to the generated gradients. Training more frequently on surprising examples would be like weighting them as x^2 - doubling down on the effect, which may be too much.

Ishinoshita commented 5 years ago

Thanks Brian!

amj commented 4 years ago

KataGo has tried this and it appears to work really well: from lightvector on the discord:

KataGo is now doing this as of some experiments in just the last couple of months, and this change is definitely going into the end of year run (somewhere from a 1.4x to a 1.8x speedup in learning, it looks like) Highly conveniently, the MCTS search in selfplay needs to compute the policy of the neural net at the root, so at data-writing time you can compute how "surprising" a given training target will be for the neural net. So I have it compute the KL-divergence = policy loss minus intrinsic entropy = E_{m from training target distribution} log(target(m)) - log(prior(m)) What I do is count how many training samples it would have normally written down for the game (which is all the non-cheap searches, from playout-cap-oscillation) = N Each position gets assigned a weight of 0.5 as a baseline (total N/2), and then the other half of the weight is distributed on these positions proportional to KL-divergence Then, each position gets written out in expectation a number of times equal to its weight. For example, a training sample position with weight 1.7 would be written once for sure, and once more with probability 0.7. On the training side, I don't make any changes - you just shuffle and batch the samples as before, it's just that some of the "surprising" samples are now in the data multiple times to increase the frequency with which the training will see them (and some of the less surprising samples have stochastically been thrown away) This scheme is almost certainly non-optimal, it was just the first thing I tried. The 0.5 baseline was to be a little conservative in this test and not go too crazy towards fully reweighting things by KL-divergence, instead only half of it does. Motivation: neural net policy improves faster by more frequently being updated on moves that the search determined were good but that the policy found unintuitive. (minor additional detail: I also allow the "cheap" playout-cap-randomization searches to participate in the KL-divergence part. They don't get the 0.5 baseline however, and in fact get a minor additional penalty. It's not so common for a cheap search, with few playouts, to deviate as much from policy. But in the occasional case that it finds something so surprising that it does, the cheap search result will get some weight and a chance to stochastically be recorded for training).

brilee commented 4 years ago

fwiw, instead adding additional examples to the stream, you can just drop them with probability (1-p) or something. Drawback is that you can't reuse that stream anymore since it hard-codes one possible sampling.

amj commented 4 years ago

@brilee yeah but we have the nice sample_records binary to do it for us :)

gcp commented 4 years ago

Then, it just becomes a question of computational efficiency to drop the examples that are unsurprising because they're not contributing much to the generated gradients. ...Training more frequently on surprising examples would be like weighting them as x^2 - doubling down on the effect, which may be too much.

I think this is key here. It would make sense for this to be a computational optimization at training time. But if it results in faster learning in a setup that is not GPU-training limited (as opposed to self-play), it fires off all my alarms that something is wrong.

Computing KL-divergence at data-writing time seems equivalent to computing the forward pass loss at training time (which you have to do anyway!). The only difference is that the KL-divergence doesn't know about any improvement the net might already have had, so it should be worse (not considering the computational gain of skipping samples that have almost no loss).

On the other hand, the training time speedup could be pretty nice. Prediction rate for strong networks is like ~64%? That means you could likely skip 2 out of 3 training positions because they contribute almost no loss to the gradient.

lightvector commented 4 years ago

Yes, KataGo is GPU-training-limited to some extent on my setup, hitting each training sample about 2-4 times, but with 8 symmetries to learn.

But this also means it's probably selfplay-limited on the value head. Not that I've actually had the resources to test different ratios all the carefully, but in general I suspect that because of the intrinsic disparity between the effective amount of policy data and the effective amount of value data, it might in fact be that the optimal train/selfplay balance has the the policy head be training-limited and the value head be selfplay-limited. If so, then anything that reduces that disparity is helpful.

tensorflow / minigo

Consider active learning approach #100