kblomdahl commented 3 years ago

NNUE (ƎUИИ Efficiently Updatable Neural Network) is a sparse shallow neural network architecture that can be calculated incrementally. This has proven incredibly successful in Chess and Shoigi. Historically algorithms that works in Chess does not necessarily translate to Go (e.g. AlphaBeta pruning), but more modern neural network architectures has generalized well (e.g. MuZero).

This issue is to formally document the thoughts I've had on the possibility of designing a NNUE architecture for Go that performs competitively when compared to larger neural networks.

kblomdahl commented 3 years ago

The chess architecture is based on a denormalization of the features based on the kings position (our_king_square, piece_square, piece_type, piece_color), and shogi uses a similar representation. The first challenge is to find a feature representation for Go that fulfills the criterias:

Is hyper sparse, at most 2% non-zeros.
Allows efficient incremental updates, at most O(1) amortized updates.

`(our_piece1, our_piece2, piece_color)`

This is the "direct" translation of the chess (KP) representation for Go, which asks for storing all piece pairs. The problem is that this representation is neither sparse nor does it allow for fast incremental updates. Adding a piece to the board would require us to update O(n) cells as this affects every other piece on the board.

I am also skeptical about this representation as it is probably not expressive enough for a shallow network to evaluate determine long-range dependencies such as ladders.

MuZero Embedding representation

The NNUE architecture reassembles the MuZero architecture with a few additional constraints. Given this we could train the representation instead of trying to figure it out ourselves:

The dynamics network is very shallow.
The prediction network is very shallow.
The representation network can be very deep and expensive to calculate since we only do it once per search tree.
The embedded representation should be of modest size. e.g. [19, 19, 64].

Experimental results

This approach doesn't seem terrible, but not sure if using such shallow networks yield an actual improvement over just using a single network (as is today), since the results are not amazing with an 15x192 representation network and standard prediction heads:

Architecture	Policy (%)	Value (%)
Baseline `9x128`	53.2%	66.2%
Dynamics `6x128`	49.4%	65.0%
Dynamics `6x64`	47.5%	65.4%
Dynamics `6x16`	41.6%	54.9%

It could also be interesting to compare a baseline 9x128 with a muzero 9x128 representation with a 9x128 dynamics network. This would be equivalent to the current model in terms of runtime, but might yield different gameplay performance. Unfortunately this is not a 1:1 comparison since I had to fickle with the hyperparameters to avoid encountering NaN during training:

Architecture	Policy (%)	Value (%)
Baseline `9x128`	53.2%	66.2%
Mu `9x128`	...	...

It is generally accepted in the computer Go community right now that adding more evaluations of worse quality generally doesn't pay off compared to just having a stronger network in the first place so this might not work unless we find a way to address the accuracy problem.

More...

...

kblomdahl commented 3 years ago

All search algorithms needs to weight the benefits of quality vs quantity of nodes explored. NNUE is a step towards quantity, where as all recent Go algorithms trend towards quality with increasingly deep and wide neural networks.

One of the reasons for this divide can be observed in the number of candidate moves per game state, chess has about 35 whereas Go has about 150. This means the cost of a misstep during search is a lot higher in Go than in Chess, which might be the explanation as for why quality is more important in Go than in Chess. This would suggest that NNUE might not work as well in Go as it does for Chess.

Another reason might be the length of an average game. In chess the average game is about 38 moves long [1], while in Go it's closer to 150. This means it is more important for Go engines to probe very deep than it is for Chess engines if they want to read the same percentage of the game ahead. This would suggest that NNUE might not work as well in Go as it does for Chess.

[1] https://chess.stackexchange.com/questions/2506/what-is-the-average-length-of-a-game-of-chess

kblomdahl commented 3 years ago

All current NNUE architectures are pure value networks which outputs 0 or 1 depending on whether the current state in a losing state or not. This is not necessarily a problem, generalizing NNUE to output both a policy and value should not be impossible.

kblomdahl commented 2 years ago

Continuing on the topic of the MuZero Embedding representation. It occurred to us that you can view the dynamics function as an RNN (or as it's more successful implementations GRU or LSTM). The advantage of this would be that RNN's are very well researched, and in general use only feedforward layers, which are fast. So gives us the following procedure:

Use the representation function to squash the input features to an embedding vector.
Evaluate and predict the next move from the embedding vector.
Set the embedding vector as the initial hidden state of the RNN.
Set the input features for the next few moves as the inputs of the RNN.
Use the outputs from the RNN for each timestep to evaluate and predict the next move.

The idea would be to have a potentially large initial representation function to compute the initial hidden state, and then allow the RNN to operate on the much smaller embedding (which would allow is to be very fast). A few open questions in regard to the modelling:

Both GRU and LSTM makes excessive use of tanh instead of relu, do we need to change the final activation of the representation function to match this?
What size should the embedding be?
Does it make sense to feed the embedding vector to both the prediction function and as the initial hidden state of the RNN?
- Should we feed start the RNN from the initial step, instead of skipping it?
- Should we feed it an empty input features instead?

Both GRU and LSTM makes excessive use of tanh instead of relu, do we need to change the final activation of the representation function to match this?

This doesn't seem to make any significant difference. The difference in accuracy is within the margin for error.

What size should the embedding be?

Experiments with 362, 722, and 2048 shows that there are clear diminishing returns. But 2048 is better than 722 which is better than 362. This needs further investigation to establish the scaling factor.

kblomdahl commented 2 years ago

Experimental results

Experimental results of using a large representation network with a small dynamics network so far show negative results, the model is bottlenecked by the dynamics network and it's accuracy scale with the minimum of the dynamics / representation size. So there is no point in having two networks of different size.

If a dynamics network that is feed the exact game state cannot scale, then I do not think there is any reason to expect a dynamics network that is feed only the motion features would keep up so this is probably a dead end in terms of research.

See https://github.com/Chicoryn/dream-go/pull/58#issuecomment-1040451748 for the full data from the experiment.

kblomdahl / dream-go

NNUE (ƎUИИ Efficiently Updatable Neural Network) for Go #56

`(our_piece1, our_piece2, piece_color)`

MuZero Embedding representation

Experimental results

More...

Both GRU and LSTM makes excessive use of tanh instead of relu, do we need to change the final activation of the representation function to match this?

What size should the embedding be?

Experimental results