Open mbabigian opened 6 years ago
Please reconsider adding time management algorithms to it and instead add it as a network parameter the program must tune itself.
How do you propose we do such a thing?
I like these ideas, however...
the NN does 2 things: calculate a policy (where to play) and a value (how good is this position). The following search (MCTS) will take over and tell the network what position to evaluate next.
As you can see, the network has no control over how deep a position will be searched or how much time is spend on it. Time left isn't even an input for the network. Making it an input would make no sense: Running the NN is expensive, so you only want to do it a few times. For example with 800 playouts you run the NN 800 times. When a position is evaluated again by MCTS (it always start off with evaluating the root), the previous NN output is reused. This previous NN output will contain the time left at the time it ran, not the current time.
Time management using MCTS is possible. You can use less time for positions that are certainly won. You can use less time if the top pick is clearly better. Also you can end MCTS early once the picked move will never change again, or discard moves that can never be picked (because they had too many or too few playouts to get changed to/from).
Also you could implement different search algorithms, like MCTSnets if you want to stay in the NN atmosphere.
If you really want to throw neural nets at the problem of time management too, then off the top of my head, no idea if this is good (and it also seems quite complex given the likely gain):
From a large number of self play games with somewhat longer time controls, periodically record what move you would have played if you had stopped the search at various earlier points than you did, and also record the simulation counts for each root move. Come up with some metric on how good all the moves you would have chosen are. Maybe, goodness(move) = play move, perform another somewhat deeper search on the resulting position, and return the deeper search's MCTS value estimate.
Then, train a separate smaller neural net that takes as input the board position, the current number of nodes spent, and for each move at the root, log(1 + current simulation count for that move), and tries to predict the quantity:
Ideally you obtain a neural net that is able to predict the marginal value of searching longer. Hopefully it can learn what moves are obvious moves, and what moves where extra thinking time really matters. You can then use this neural net as part of your time management algorithm to choose when to spend more time.
Wouldn't it be possible to have the NN output an easy mover 0 or 1? It could be trained based on whether we can stop at say 100 playouts and get the same result as continuing on to the full 800. In self play we would always do the full 800, but in matches we would move quicker if easy mover is 1, and save time for other moves. Theoretically, this could be scaled with more settings besides on and off, but probably only if the initial test is promising.
The "how" is nontrivial, the why however stands without further explanation. The simplest method would be to provide the bucket of time and have the program randomly assign time to moves throughout the game from the remaining bucket. The NN will learn quickly that allocating all the time causes a loss (losing on time is part of the rules of chess and we failed to give that rule). Allow it to randomly vary how much time it spends per move until it begins to associate time spent and position with outcomes. Eventually the time allocation will move away from random and some pattern will emerge. No strategy for allocating the time should be provided to start (pure random time play).
I think we could construct some probabilistic model and test the hypothesis that the current top move is equal to the true top move (as the number of simulations approaches infinity), kind of like SPRT. The problem is estimating the true distribution from the observed data, which is very not independent... Maybe someone with a background in statistics has a better idea.
For example we could test the chance of the top move changing in the next 100 sims by assuming selection probability at the root stays roughly the same.
The time spent has to be tied to the nature of the position on the board. The NN must make the association between board position and relative time allocation versus remaining time. Anything else is imparting human knowledge to the engine IMHO. The NN must learn on its own which types of positions warrant longer study. Programming a statistical idea that humans choose because they think it might work for time allocation is again adding our knowledge. The NN knows the position and if it randomly tries allocating 1/4 of the remaining time and it wins, that should reinforce that behavior for that position type. If it loses, it should consider such a large time allocation bad for that position. I see no issues with the NN discovering for itself that a split second is plenty in a KvKP ending and that a 30 piece open position requires relatively higher amounts of time to increase the win rate. Deciding for the NN that what is important is the probability that the top move pick is actually best is again heading in the wrong direction as we are doing the thinking for it by picking that metric on its behalf. The rules are that the player (engine) can allocate any amount of time up to the total time remaining for its move, and that if your time reaches zero, you lose. If anything much more than that is implemented, humans are playing for it. The difference is simply, is this a "Zero" project effort, or "Zero" when it is convenient? ;)
@mbabigian I see your point about leaving out domain-specific knowledge, but simple statistical tests like the one I described have nothing to do with chess. It's in the same spirit as auto-resign, trading accuracy for speed. I'd say the input/output plane format and UCB heuristic are much more "non-zero" than hypothesis testing.
@Akababa I hear ya. It is a slippery slope and very easy to start adding traditional chess program algorithms all over the place. If traditional algorithms are going to be added, I'd prefer they add proper UCI support algorithms as I don't see Leela-Chess learning how to behave as a UCI engine through training ;)
I mentioned this with regard to AlphaZero at talkchess as well. Time management is not some separate thing. It is, and has been a part of the rules of chess for more than 100 years. The point of a "Zero" project is to bootstrap the AI from nothing other than the rules of chess. Failing to articulate the time rules and superimposing human time management thoughts onto the network will at the very least make it weaker, and at worst is outright cheating by adding human knowledge.
Any strong player can glance at a king and pawn ending (2 kings, 1 pawn) and know the game outcome. Analyzing such a position for days will not change it. However, spending time in a complicated middle game may indeed yield better results. Leela-Chess should be learning during its training where to apply more time and where not to. Leela-Chess should be provided a bucket of time (the time control) and should decided based on its own network info where best to spend that time.
In this regard, AlphaZero was not a "Zero" project either as the operators were forcing a 1 min/move human time management idea on it during its Stockfish match. Please reconsider adding time management algorithms to it and instead add it as a network parameter the program must tune itself. I think in the end you'll find the strength of the tuned program will be stronger than allowing human programmers to put crude time management ideas on it.
My 2 cents.