Open dshawul opened 4 years ago
Thanks for the interest. Good questions.
It seems that in the new version of the KataGo paper, all the details about board masking to handle different board sizes in the same mini-batch are removed.
It was removed from the main paper in the recent versions just due to space limitations and because it wasn't a focus of the research. The method has remained unchanged, and the actual architectural details are still present and precisely described in the appendix.
I was thinking of using the same board sizes in one mini-batch but alternate between different board sizes from minibatch to minibatch. This because for one it avoids the use of masks. Keras, for instance, accepts inputs of (None, None, C) instead of (B, B, C) and handles fully-convolutional network training on variable board sizes.
You could certainly make the board sizes vary by batch instead of by sample. The tradeoffs I can think of would be:
Is the global-pooling structure somewhat similar to squeeze-excitation nets? Do you think SE block needs to be modified similarly when training with multiple board sizes?
Yes, SE would also need to be modified.
Also, I don't understand why out of the 3C channels the global-pooling structure produces for the value-head, 2C channels are board sized scaled versions of the C channels. Since a dense layer follows, the NN should be able to learn to scale appropriately with the board size.
Think about this. Suppose the pooling only reports the mean activation of each channel. Suppose that for some of the channels actually the more useful feature that future layers "want" is the mean activation of these channels scaled by board size, rather than just plain mean.
Now, what is the dense layer supposed to do? What fixed linear combination of means-of-channels gives you mean-scaled-by-board-size? (Let's also suppose that the activation strengths of the channels, which were computed via a few convolutions in earlier layers, don't vary much with board size in the first place - the entire point of pooling at all is because convolutions are bad at this even if they can do it somewhat. Perhaps even this is early enough in the net that the perceptual radius is literally too small).
Also, wouldn't it be easier to just provide an input plane or single binary feature indicating the relative size of the board?
You could do this. Mechanically then the way the net would be able to scale different values internally by board size would be to break up board size piecewise into a bunch of fragments using ReLU, take the to-be-scaled-value and split it across multiple channels with different weights based on which board size fragments are active, and then recombine with different weights. (Recall that in a convolutional net, there are no product terms between inputs. There are only products between inputs and fixed weights, so any scaling of inputs by each other must be done by defining piecewise linear chunks through multiple channels).
Intuitively I'd guess scaling the pooling directly to be better than forcing the net to do that. Besides, if the neural net actually did also "want" board size as a numeric input feature, after merely one such global pooling layer it can trivially compute that on its own and then use it as desired in all subsequent layers, so scaled the pooling should almost just dominate an input feature.
Another question is that I used global-max-pooling in the value head by mistake and noticed the net has difficultly learning. Did you observe similar behavior?
No, I never even tested such a thing in the first place because I had a strong intuition from first principles that this would be a bad idea.
To first order, to find chance that you win, given some uncertainty in one's evaluation, you need to do something like expected score / sqrt(variance of score)
. The units could be arbitrary and not simply "points" and with hacks for discreteness of situations and correlations between fights, etc. Still, this is the natural first-order thing. Expected score is the sum over the board of the score from each part of the board. Variance of score is going to look like a sum over the board of the variance contribution from each part of the board due to how "uncertain" you think your evaluation of that part of the board is. The neural net may of course choose to do some of this scaling before pooling, and will need lots of adjustments to deal with correlations and other complexities.
But either way, you want to sum across the board, not max across the board.
The same thing is even more obviously true for predicting the score itself - you want to sum the territory across the board, not max it.
Sums can also be computed by scaling a mean pooling, so mean pooling is also okay as the "basic" operation . But: the same number of total points advantage on a large board translates to a smaller advantage in winning chance, because more situations means that even holding constant the uncertainty of each individual situation, the total uncertainty will be larger. Therefore winrate and score estimation will want different scalings by board size. So in addition to simply max alone being bad, there should also be a benefit to directly providing different board-size scalings of the pooling in the value head.
Thanks for the thorough explanation, much appreciated!
Now, what is the dense layer supposed to do? What fixed linear combination of means-of-channels gives you mean-scaled-by-board-size?
Ok thanks, I mixed up my thoughts there. Maybe you will only need Average pooling if one had a relative board size input plane?
No, I never even tested such a thing in the first place because I had a strong intuition from first principles that this would be a bad idea.
I meant using your global-pooling layer that outputs 3C channels with ( Average, scaled-average, Max) pooled layer for the value head ( I know you replaced the Max with quadratically scaled Avg pooling but I used Max by mistake there too). The Max pooling seems to screw up learning a lot despite the presence of Average pooling. With Max pooling along I think it would be terrible as you said. With just average pooling, training seems to be going well, although I am not sure how efficient a global-pooling+fully-connected in the value head is compared to the regular flatten+fully-connected.
You could do this. Mechanically then the way the net would be able to scale different values internally by board size would be to break up board size piecewise into a bunch of fragments using ReLU, take the to-be-scaled-value and split it across multiple channels with different weights based on which board size fragments are active, and then recombine with different weights.
Thanks for this insight, it sure looks more efficient to directly provide the value head a board-size-scaled output from global pooling layer.
I would like to share with you a unique problem I had with making chess networks board size-independent, maybe be you have some ideas too. In Chess the moves are of from-to type, so initially, I thought about outputting two planes for a move to represent probabilities of from and to squares separately. Then these have to be combined "somehow" to produce probability for a move. This will not be a good representation however since weights for different moves sharing same to/from squares will affect one another. Finally, I came up with a representation that exploits the property of slider moves, namely, only one slider move can come from a given direction to a given to-square. So we have 8 moves for QRB, 8 for knights for a total of 16 channels for a policy of size 16x8x8. A0 took into consideration the from-square as well and need bigger policy output 73x8x8, which I am not sure is any better than 16x8x8. However, this raises a question when you have a game with a rule that two or more slider moves can come from the same direction (e.g. if Queen was allowed to jump over Bishop). I don't know how to make such a game board-size independent ...
I meant using your global-pooling layer that outputs 3C channels with ( Average, scaled-average, Max) pooled layer for the value head ( I know you replaced the Max with quadratically scaled Avg pooling but I used Max by mistake there too). The Max pooling seems to screw up learning a lot despite the presence of Average pooling. With Max pooling along I think it would be terrible as you said. With just average pooling, training seems to be going well, although I am not sure how efficient a global-pooling+fully-connected in the value head is compared to the regular flatten+fully-connected.
Hmm, that's interesting. I'm curious - do you also get better results from dropping the max pooling part everywhere? Perhaps generally max is hard to train (it certainly does do funny things if you think about the gradients).
I would like to share with you a unique problem I had with making chess networks board size-independent, maybe be you have some ideas too. In Chess the moves are of from-to type, so initially, I thought about outputting two planes for a move to represent probabilities of from and to squares separately. Then these have to be combined "somehow" to produce probability for a move. This will not be a good representation however since weights for different moves sharing same to/from squares will affect one another. Finally, I came up with a representation that exploits the property of slider moves, namely, only one slider move can come from a given direction to a given to-square. So we have 8 moves for QRB, 8 for knights for a total of 16 channels for a policy of size 16x8x8. A0 took into consideration the from-square as well and need bigger policy output 73x8x8, which I am not sure is any better than 16x8x8. However, this raises a question when you have a game with a rule that two or more slider moves can come from the same direction (e.g. if Queen was allowed to jump over Bishop). I don't know how to make such a game board-size independent ...
In that case why not have a separate sliders output for each different piece type that can slide?
Unfortunately, I have not used the global-pooling layer of katago inside the residual blocks, I use regular SE there which does only average pooling.
I guess separate outputs for each different piece will work as long as we ignore Queen moving over queen. Somehow I thought there would be a general approach to boardsize independent neural network for any game, but I guess one needs to do simplifications like that.
Thanks!
No prob. By the way, you might consider testing adding a board size scaling set of channels within your SE as well and see if it helps? I would be interested to know if it improves handling of a wider range of sizes. Or if in fact it doesn't do anything. :)
Nothing about the scaling part of the unit would change, I think you'd just have 2x as many channels in the first layer of the little 2-layer mini-net that computes the sigmoid activations to scale by, one set of channels being board-size scaled.
could you compile a win32 version katago for big board? @lightvector
Oh, just noticed this now. @l1t1 - are there really people that are still on win32 and not win64? I was considering actually at some point stopping compiling that version for the normal releases too.
This is not an issue but a question about how katago handles different board sizes. Please feel free to move it or direct me to where to post the question if it can't stay here.
It seems that in the new version of the KataGo paper, all the details about board masking to handle different board sizes in the same mini-batch are removed. I was thinking of using the same board sizes in one mini-batch but alternate between different board sizes from minibatch to minibatch. This because for one it avoids the use of masks. Keras, for instance, accepts inputs of (None, None, C) instead of (B, B, C) and handles fully-convolutional network training on variable board sizes.
Is the global-pooling structure somewhat similar to squeeze-excitation nets? Do you think SE block needs to be modified similarly when training with multiple board sizes? Also, I don't understand why out of the 3C channels the global-pooling structure produces for the value-head, 2C channels are board sized scaled versions of the C channels. Since a dense layer follows, the NN should be able to learn to scale appropriately with the board size. Also, wouldn't it be easier to just provide an input plane or single binary feature indicating the relative size of the board? Another question is that I used global-max-pooling in the value head by mistake and noticed the net has difficultly learning. Did you observe similar behavior?
Thanks for the great work!