glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
759 stars 298 forks source link

Oversampling of positions per game in training #590

Open theseapples opened 6 years ago

theseapples commented 6 years ago

Everyone seems fixated on the lack of progress and the need to change every known possible parameter, but what is troubling is that no one is addressing the issue of oversampling of positions per game.

By my own calculation, and multiple independent calculations by members of the Discord, we are oversampling positions per game by a factor of 24 over Alpha Zero. With the strict sense of adherence to the Deepmind philosophy this project has, and the goal of reproducing it - we should address fixes to the training pipeline so that our sampling of positions per game are within some reasonable value of Alpha Zero.

I define the sampling of positions per game by the formula - steps batchsize / games Samples per game for Alpha Zero: 700000 4096 / 44000000 65 positions sampled per game

Samples for Leela Zero assuming 500k window and 33k games per network: Steps (proportion of games of one network in the window) batchsize (time one network is in the window) / games of one network = steps batchsize / games per network

50000 (33000/500000) 1024 (500000/33000) / 33000 50000 1024 / 33000 1551 positions sampled per game

We’re currently oversampling positions in each game by a factor of 24 over Alpha Zero. Thus with an average game length of 145 ply, we’re sampling each position 10.7 times versus .45 times for Alpha Zero with the same game length as ours. Since Alpha Zero enabled resign their sampling of positions per game of 65 would probably be closer to a sampling of each position .75 times.

The possible corrections to the training pipeline are - to keep the games per network constant and either decrease the steps, decrease batchsize, or a combination of both.

If we decrease the steps to 4000 and keep the networks at 33,000 games (5 hour intervals at the moment) our samples per game would be 124 (4000 * 1024 / 33,000) with an average game length of ~145, meaning, each position is sampled less than one time on average.

Until we address this issue I would be highly skeptical of the continuing effort to change other parameters when it’s a known fact that we’re currently oversampling positions per game and the tensorboard has shown divergence for quite some time.

theseapples commented 6 years ago

I trained a 20x256 net with ~100 positions per game on v.8+ data (11,200,000+) with a window of 500k games and it didn’t exhibit any of the problems the current net is exhibiting. While the high lr (.01) might not have been capable of fitting enough to pick up the underlying bug it scored a higher probability than any previous net for the policy head in the position outlined in this pastebin.

https://pastebin.com/wFrLVm50 weights file: https://drive.google.com/open?id=1zW1527W-IfwvjNXZ9mSOMJeYVblUesU-

go nodes 1 info string Rxd7 -> 2 (V: 80.29%) (N: 98.56%) PV: Rxd7 Bh3 Rd6 info string stm White winrate 75.41%

Thus for us to fix this problem the following formula must be included somewhere in the training pipeline.

(batchsize * steps per net / games per net) < avg moves per game

Sadly server side gating won’t solve the oversampling problem unless the gating restarts the training from the previous best net and destroys all the steps to the failed net. Otherwise gating will only move the oversampling from a equal oversampling of all nets to an even larger oversampling of the current net.

killerducky commented 6 years ago

@logosg are you able to train your own 15x196 net? Then we can do 2 experiments in parallel.

Edit:

gating restarts the training from the previous best net and destroys all the steps to the failed net.

This is what LZGo does.

theseapples commented 6 years ago

I can but if it’s bad data in, it will be bad data out. If the data that the official net is producing is becoming worse, whatever parameters you choose in training, the net that is piggy backing off that data will eventually become worse.

And it’s good to know that the LZGo gates in such a way, as it should prevent overfitting. Thanks.

dubslow commented 6 years ago

Absolutely fantastic work, logosg. I look forward to that GPU of yours doing further experiments in the future.

As for the Rxd7 blunder, that is also fixed in current networks, >97% prior in id285. Can you find another, more general way to compare strength with current 28x-era networks? Run a match between them?

fli commented 6 years ago

I thought this was the primary issue, but thinking about this more, is oversampling really a problem? Isn't this the same thing as training for many epochs, which is done a lot in image classification. I believe they get around any possible overfitting from many epochs by using early stop or using sufficient regularisation.

Sampling less might solve the problem, but it shouldn't be a problem with sufficient regularisation.

I think having completely separate test data is more important to try first, it looks like we are training with test data since it is selected randomly each training cycle. https://github.com/glinscott/leela-chess/issues/595

EDIT: If the goal is to replicate the Alphazero paper's parameters, then I agree that sampling less is the solution. Otherwise I would think about changing regularisation somehow.

jjoshua2 commented 6 years ago

It's pretty common in ml to use all the data for each step and it mostly works the same just slower.

Let's take the two extremes, on one side each gradient descent step is using the entire dataset. You're computing the gradients for every sample. In this case you know exactly the best directly towards a local minimum. You don't waste time going the wrong direction. So in terms of numbers gradient descent steps, you'll get there in the fewest.

Of course computing the gradient over the entire dataset is expensive. So now we go to the other extreme. A batch size of just 1 sample. In this case the gradient of that sample may take you completely the wrong direction. But hey, the cost of computing the one gradient was quite trivial. As you take steps with regard to just one sample you "wander" around a bit, but on the average you head towards the same local minimum as in full batch gradient descent.

This might be a moment to point out that I have seen some literature suggesting that perhaps this bouncing around that 1-sample stochastic gradient descent does might help you bounce out of a local minima that full batch mode wouldn't avoid, but that's debatable. The general literature suggests that both converge to the same place in most cases.

In terms of computational power, while the single-sample stochastic GD process takes many many more iterations, you end up getting there for less cost than the full batch mode, "typically." This is how Andrew Ng puts it.

Now let's find the middle ground you asked about. We might realize that modern BLAS libraries make computing vector math quite efficient, so computing 10 or 100 samples at once, presuming you've vectorized your code properly, will be barely more work than computing 1 sample (you gain memory call efficiencies as well as computational tricks built into most efficient math libraries). And averaging over a batch of 10, 100, 1000 samples is going to produce a gradient that is a more reasonable approximation of the true, full batch-mode gradient. So our steps are now more accurate, meaning we need fewer of them to converge, and at a cost that is only marginally higher than single-sample GD.

Optimizing the exact size of the mini-batch you should use is generally left to trial and error. Run some tests on a sample of the dataset with numbers ranging from say tens to a few thousand and see which converges fastest, then go with that. Batch sizes in those ranges seem quite common across the literature. And if your data truly is IID, then the central limit theorem on variation of random processes would also suggest that those ranges are a reasonable approximation of the full gradient.

Deciding exactly when to stop iterating is typically done by monitoring your generalization error against an untrained on validation set and choosing the point at which validation error is at its lowest point. Training for too many iterations will eventually lead to overfitting, at which point your error on your validation set will start to climb. When you see this happening back up and stop at the optimal point.

On Sun, May 13, 2018, 11:21 AM Francis notifications@github.com wrote:

I thought this was the primary issue, but thinking about this more, is oversampling really a problem? Isn't this the same thing as training for many epochs, which is done a lot in image classification. I believe they get around any possible overfitting from many epochs by using early stop or using sufficient regularisation.

Sampling less might solve the problem, but it shouldn't be a problem with sufficient regularisation.

I think having completely separate test data is more important to try first, it looks like we are training with test data since it is selected randomly each training cycle. #595 https://github.com/glinscott/leela-chess/issues/595

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/590#issuecomment-388634677, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INKZjjF6IHVuHfZR-gsdXGuyA1jqLks5tyE-FgaJpZM4T8h_2 .

jjoshua2 commented 6 years ago

https://arxiv.org/abs/1711.00489 This is a good article about decaying learning rate vs increasing batchsize. Instead of cyclical learning rate we could try a smaller batchsize to allow gradient steps in the wrong direction to get out of minima. And then increase batchsize for the fine tuning.

On Sun, May 13, 2018, 11:59 AM Mark Jordan jjoshua2@gmail.com wrote:

It's pretty common in ml to use all the data for each step and it mostly works the same just slower.

Let's take the two extremes, on one side each gradient descent step is using the entire dataset. You're computing the gradients for every sample. In this case you know exactly the best directly towards a local minimum. You don't waste time going the wrong direction. So in terms of numbers gradient descent steps, you'll get there in the fewest.

Of course computing the gradient over the entire dataset is expensive. So now we go to the other extreme. A batch size of just 1 sample. In this case the gradient of that sample may take you completely the wrong direction. But hey, the cost of computing the one gradient was quite trivial. As you take steps with regard to just one sample you "wander" around a bit, but on the average you head towards the same local minimum as in full batch gradient descent.

This might be a moment to point out that I have seen some literature suggesting that perhaps this bouncing around that 1-sample stochastic gradient descent does might help you bounce out of a local minima that full batch mode wouldn't avoid, but that's debatable. The general literature suggests that both converge to the same place in most cases.

In terms of computational power, while the single-sample stochastic GD process takes many many more iterations, you end up getting there for less cost than the full batch mode, "typically." This is how Andrew Ng puts it.

Now let's find the middle ground you asked about. We might realize that modern BLAS libraries make computing vector math quite efficient, so computing 10 or 100 samples at once, presuming you've vectorized your code properly, will be barely more work than computing 1 sample (you gain memory call efficiencies as well as computational tricks built into most efficient math libraries). And averaging over a batch of 10, 100, 1000 samples is going to produce a gradient that is a more reasonable approximation of the true, full batch-mode gradient. So our steps are now more accurate, meaning we need fewer of them to converge, and at a cost that is only marginally higher than single-sample GD.

Optimizing the exact size of the mini-batch you should use is generally left to trial and error. Run some tests on a sample of the dataset with numbers ranging from say tens to a few thousand and see which converges fastest, then go with that. Batch sizes in those ranges seem quite common across the literature. And if your data truly is IID, then the central limit theorem on variation of random processes would also suggest that those ranges are a reasonable approximation of the full gradient.

Deciding exactly when to stop iterating is typically done by monitoring your generalization error against an untrained on validation set and choosing the point at which validation error is at its lowest point. Training for too many iterations will eventually lead to overfitting, at which point your error on your validation set will start to climb. When you see this happening back up and stop at the optimal point.

On Sun, May 13, 2018, 11:21 AM Francis notifications@github.com wrote:

I thought this was the primary issue, but thinking about this more, is oversampling really a problem? Isn't this the same thing as training for many epochs, which is done a lot in image classification. I believe they get around any possible overfitting from many epochs by using early stop or using sufficient regularisation.

Sampling less might solve the problem, but it shouldn't be a problem with sufficient regularisation.

I think having completely separate test data is more important to try first, it looks like we are training with test data since it is selected randomly each training cycle. #595 https://github.com/glinscott/leela-chess/issues/595

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/590#issuecomment-388634677, or mute the thread https://github.com/notifications/unsubscribe-auth/AO6INKZjjF6IHVuHfZR-gsdXGuyA1jqLks5tyE-FgaJpZM4T8h_2 .

theseapples commented 6 years ago

I retrained the 20x256 net with the. v.8 (11.2-11.7m) games with a window of 500k, lr of .002 for 150,000 steps and .0002 for 50,000 steps, and a batch size of 256. This equates to 102 positions sampled per game. I didn’t experience a high divergence of the train and test mse and the accuracy increased to 68.7% from 59% for the previous .01 lr net. The pastebin position did lose some probability in the policy head.

step 1950000, lr=0.0002 policy=1.70841 mse=0.108837 reg=0.117976 total=1.93522 (1030.69 pos/s) step 1950000, policy=1.71022 training accuracy=68.7272%, mse=0.107988

info string Rxd7 -> 2 (V: 82.66%) (N: 96.53%) PV: Rxd7 Bh3 Rc7 info string stm White winrate 80.27%

New weights: https://drive.google.com/file/d/1cIsS6wgij5GPQXBYGiZXa5yfEgx-1WH2/view

mooskagh commented 6 years ago

https://docs.google.com/spreadsheets/d/12VI4_Ho1tLoGFQVOH3ixly7b1FE8oc6ZXl5Wj2yI7zc/edit#gid=0

Weights distribution of the first convnet. the rule50 plane is still a bit large, but comparable with other "important" planes. Also it's the only plane where average is pretty far from zero.

Also color plane is pretty prominent too.

amj commented 6 years ago

@jjoshua2 your theory is sound but you're misunderstanding how the pipeline works.

For standard supervised learning (SL), looking at every data point makes sense and you can take the optimum gradient descent. No arguments there.

But for LCZ, the issue is that your training iterations repeat, and most of the games are re-used. If the window is 500k games, and only 20k new games are played before the next training, then only 4% of the data is new, meaning that you'll train on the same game 25 times.

For the value head, it's even worse. Instead of (500k games * 100 positions/game for 50M unique dattapoints, there's only 500k datapoints.

So the value head is actually seeing only 500k unique datapoints, and its seeing them 100x too often.

For AGZ, the numbers in the paper for window size, games, etc seem to work out to have the network see each position only once.

amj commented 6 years ago

(This "how many unique results are there for the value net" is why, in the AGZ paper, they describe training a supervised network on pro games by weighting the value component of the loss function by 0.01 -- it prevents the network from over-learning from multiple positions of the same game)

ASilver commented 6 years ago

I tried testing the 20x NN but cannot get LCZero to even tune for it, much less run it. I unzipped and named it weights.txt. Am I missing something?

peterjacobi commented 6 years ago

Do I have a huge misunderstanding in this:

500k games in the past is about net 300. In our current training regime, over these 500k games we got to net 315 by training 15 times with 500k games -- of which only 33k were new each time. Together with other parameters that is the dreaded oversampling problem, the exact calculations are at the top of this thread.

Now, if oversampling isn't just an abstract problem, but a real problem linked to the playing strength of the training result, it should have given better results to train net 300 only once, by waiting that 500k games by it are accumulated, and training on these.

We cannot do this exact test, as we don't have 500k games played by net 300. Hoping that the difference is only a second order effect, we can put this to test by using 500k really existing games, played by nets 300 to 314. Will just one training run with these make a stronger net than net 315? My gut feeling is, that I can't believe this, but feelings are not evidence.