glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
759 stars 298 forks source link

Training: Automate hyperparameter tuning #324

Open Error323 opened 6 years ago

Error323 commented 6 years ago

Training new neural networks currently involves making decisions given the following parameters [1]:

And observing their results in the form of value- and policy loss as shown here http://training.lczero.org. As some of you may know, this is tricky to do well. It would be both interesting and useful to try and find optimal parameters through some optimization method like e.g. Gaussian Processes [2].

@kiudee has a lot of experience in the latter and we've discussed this a little already, but I thought it would be good to make this a public effort where everyone can share their thoughts. To start of the discussion I thought we could maybe try to learn the optimal learningrate given the following input:

We also have the challenge of an underlying data distribution that changes as new nets generate new data. This may mean that a net that looks amazing in terms of policy and value loss might actually have overfitted on the current training window (even given a properly separated train and testset in that window like we do now).

[1] For a list of what (some of) these words mean see https://developers.google.com/machine-learning/glossary/. [2] http://www.gaussianprocess.org/gpml/

jkiliani commented 6 years ago

Would it make sense to consider both the (currently hardcoded) L2 regularisation parameter, and the policy and value loss weighting in the total loss term as potentially tuneable parameters?

kiudee commented 6 years ago

Then how would you evaluate the performance of one parameter configuration? For now we were planning to evaluate on test mse and policy loss.

jkiliani notifications@github.com schrieb am So., 15. Apr. 2018, 11:02:

Would it make sense to consider both the (currently hardcoded) L2 regularisation parameter, and the policy and value loss weighting in the total loss term as potentially tuneable parameters?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glinscott/leela-chess/issues/324#issuecomment-381391225, or mute the thread https://github.com/notifications/unsubscribe-auth/AAic-TOdZoD8cDio8ZN8_OtJtGZIACuIks5towydgaJpZM4TVYnQ .

isty2e commented 6 years ago

Considering the cost to train and evaluate neural networks, isn't it better to reduce the number of hyperparameters to be optimized? And I guess evaluating on loss function while changing training window size can be somewhat dangerous. IMHO a good starting point would be optimizing (learning rate, game rate input) or so.

Error323 commented 6 years ago

I agree, locking as many variables as possible is a good idea as the statespace explodes. Though I think there is a strong correlation between windowsize and game input rate. But we might want to start by fixing windowsize also indeed or defining it as a percentage of new data? I.e. let the window size be of a length L such that it contains 20% new data.

edit: OTOH everything is correlated and I don't know :sweat_smile:

isty2e commented 6 years ago

Does LCZero support batching exceeding VRAM? If so, optimizing only batch size and learning rate is probably better (and common too).

jkiliani commented 6 years ago

Defining window size by percentage of new data sounds risky, since it means a reduction of active clients will automatically reduce the window, which could really overfit the value head.

Not sure if making the window dependent on measured progress, i.e. Self-play Elo, would work well, but making it proportional to new data without lower capping it at the same time sounds like something that could easily go wrong.

ThomasCabaret commented 6 years ago

Hi, I think an important point should be to have continous hyperparameters from one training session to the next one, i mean that if something is dynamically changed during a training session the next training session will be initialized with the hyperparams values at the end of the previous session. This would avoid destructive behaviors when training sessions starts and allow us to spot more easily the cause of any disruption. (I am not sure right now if the disruption we see comes from different params at the start of a training session or just is it due to the new data content)

Error323 commented 6 years ago

Hi @ThomasCabaret

Hi, I think an important point should be to have continous hyperparameters from one training session to the next one, i mean that if something is dynamically changed during a training session the next training session will be initialized with the hyperparams values at the end of the previous session.

This is a fair point. The idea now would be to run a separate training instance that's performed independently so as to observe the training behavior without using it directly persee.

(I am not sure right now if the disruption we see comes from different params at the start of a training session or just is it due to the new data content)

I think it's because of the switch to the new 128x10 architecture and the two neuralnet input bugfixes.

@jkiliani

Defining window size by percentage of new data sounds risky, since it means a reduction of active clients will automatically reduce the window, which could really overfit the value head.

Agreed.

@isty2e

Does LCZero support batching exceeding VRAM? If so, optimizing only batch size and learning rate is probably better (and common too).

Not yet, but this is something to think about.