question about scoreStdev and training

First of all, English is not my first language so please excuse any mistakes.

I'm doing a lot of research with some reference to your papers and code, and I'd like to get advice regarding score maximization and training. I'm not using the same code as katago, so please let me know if there's something I'm missing.

First, it seems that the scoreMean and scoreMeanSq values are used for score maximization. scoreMeanSq uses the scoreMean * scoreMean + scoreStdev + scoreStdev values according to the latest version, so I am wondering why scoreMeanSq is used. Can't we just use scoreStdev in getScoreUtility?

scoreMeanSq = scoreMean * scoreMean + scoreStdev * scoreStdev;

Second, there is a problem that the output value of score_stddev continues to grow while training. In my first xavier randomly initialized model, the scoreMean is -1.46 and the scoreStdev is 14 on the blank board. According to the formula mentioned above, scoreMeanSq already has a value of around 200. However, as I train, the scoreStddev gets bigger, so the scoreMeanSq is greater than 10000. As expected, in expectedWhiteScoreValue function If the scoreMeanSq value increases over a certain amount, it always returns the same value. I think this is also a cause of not training. Do you have any opinions regarding this? Or is there any skill that scoreStdev can train reliably?

Lastly, not only scoreStdev, but also score-related losses such as pdf and cdf tend to increase after the loss drops at the beginning of training. I wonder if you have had this kind of experience. SWA, which appeared to have been applied to katago, has not been applied yet. And the moving average loss was excluded because it seemed that the training seemed to be worse. Will these two have a big impact on training? Or any comments would be appreciated.

I'm doing a lot of research with some reference to your papers and code, and I'd like to get advice regarding score maximization and training. I'm not using the same code as katago, so please let me know if there's something I'm missing.

First, it seems that the scoreMean and scoreMeanSq values are used for score maximization. scoreMeanSq uses the scoreMean * scoreMean + scoreStdev + scoreStdev values according to the latest version, so I am wondering why scoreMeanSq is used. Can't we just use scoreStdev in getScoreUtility?

We could, but the point of scoreMeanSq is so that we are separately also able to accumulate the second moment of score in the search tree as an interesting stat. In general, given two random distributions X and Y, if we let Z be the mixture distribution that is X with probability p and Y with probability 1-p, then: E[Z] = p E[X] + (1-p) E[Y] ("first moment, i.e. mean") E[Z^2] = p E[X^2] + (1-p) E[Y^2] ("second raw moment") whereas neither of the following are true: Stdev[Z] = p Stdev[X] + (1-p) Stdev[Y] Var[Z] = p Var[X] + (1-p) Var[Y]

So it's mathematically super-convenient for summing across the search tree to just use the second raw moment (there is also a cost of some numeric precision here but that's no big deal in practice). But this has absolutely nothing to do with training. The neural net reports stdev, not the second raw moment, and the neural net never ever sees the second raw moment or anything like it - it's just a mathematical trick.

Second, there is a problem that the output value of score_stddev continues to grow while training. In my first xavier randomly initialized model, the scoreMean is -1.46 and the scoreStdev is 14 on the blank board. According to the formula mentioned above, scoreMeanSq already has a value of around 200. However, as I train, the scoreStddev gets bigger, so the scoreMeanSq is greater than 10000. As expected, in expectedWhiteScoreValue function

Is the standard deviation of the score of your self-play game results as large as 100 points? With random play, the game can be easily lost or won with one player controlling almost the whole board or the other one just by chance. So if you are starting out learning from random play, the standard deviation being 100s of points is correct, and you should be prepared for this, by using an activation function capable of reporting these values, and by setting the scaling of that output so that outputting such a value does not require too absurd an internal activation strength within the neural net.

If the scoreMeanSq value increases over a certain amount, it always returns the same value. I think this is also a cause of not training. Do you have any opinions regarding this? Or is there any skill that scoreStdev can train reliably?

I still don't understand why you are talking about scoreMeanSq in the context of the neural net. The neural net should never be dealing with this value, as I mentioned before, it's purely a mathematical trick for the search tree, so there is no sense in talking about it for thinking about how the neural net learns. What does "always returns the same value" mean?

Lastly, not only scoreStdev, but also score-related losses such as pdf and cdf tend to increase after the loss drops at the beginning of training. I wonder if you have had this kind of experience. SWA, which appeared to have been applied to katago, has not been applied yet. And the moving average loss was excluded because it seemed that the training seemed to be worse. Will these two have a big impact on training? Or any comments would be appreciated.

I'm not sure what the moving average loss you are talking about is - could you clarify? SWA of course should not have any effect on the losses you see, SWA literally does not affect the training in any way. I'm not exactly sure what would cause the loss pattern you observe though, and unfortunately I don't know if the loss in my training run even follows the same pattern or not either, since I don't pay much attention to the beginning of training in KataGo any more - it proceeds so fast that even on 19x19, the bot is strong human amateur dan within the first few days without many GPUs (and much less than 1 day with the normal cluster of GPUs that Kata has been using).

(edit: fixed typos in math) (edit: a few more clarifications)

There were certainly parts of your message that weren't clear to me. But let me know if that helps, or if I misunderstood anything you were saying, happy to explain further. :)

Maybe a further question if you're seeing training instability at the start of your run - have you tried adjusting the learning rate and/or looking at the magnitude of gradients and using gradient clipping?

Also, are there major architectural differences with your net? For example, I found that once I switched KataGo to no longer use batch norm, gradient clipping was needed to prevent instability, but other than that it learned quite fine.

First of all, thanks for the quick and friendly answer. Based on what you said, I have been delayed to do some experiments. Sorry for the late response.

We could, but the point of scoreMeanSq is so that we are separately also able to accumulate the second moment of score in the search tree as an interesting stat. In general, given two random distributions X and Y, if we let Z be the mixture distribution that is X with probability p and Y with probability 1-p, then: E[Z] = p E[X] + (1-p) E[Y] ("first moment, i.e. mean") E[Z^2] = p E[X^2] + (1-p) E[Y^2] ("second raw moment") whereas neither of the following are true: Stdev[Z] = p Stdev[X] + (1-p) Stdev[Y] Var[Z] = p Var[X] + (1-p) Var[Y] So it's mathematically super-convenient for summing across the search tree to just use the second raw moment (there is also a cost of some numeric precision here but that's no big deal in practice). But this has absolutely nothing to do with training. The neural net reports stdev, not the second raw moment, and the neural net never ever sees the second raw moment or anything like it - it's just a mathematical trick.

My mathematical knowledge is weak, but I will try to understand it. The reason I was initially obsessed with scoreMeanSq was that I thought I would use this value as it is when calculating scoreUtility However, I confirmed that scoreMeanSq is getting the value of stdev using the getScoreStdev function, and the question I had was resolved somewhat.

Is the standard deviation of the score of your self-play game results as large as 100 points? With random play, the game can be easily lost or won with one player controlling almost the whole board or the other one just by chance. So if you are starting out learning from random play, the standard deviation being 100s of points is correct, and you should be prepared for this, by using an activation function capable of reporting these values, and by setting the scaling of that output so that outputting such a value does not require too absurd an internal activation strength within the neural net.

Yes. Starting with random play at first, self-playing games with large standard deviations are being created. I will report the value later. Is it okay to understand that scaling is correcting the 20 values multiplied below, for example?

scorestdev_prediction = tf.math.softplus(miscvalues_output[:,1]) * 20.0

I still don't understand why you are talking about scoreMeanSq in the context of the neural net. The neural net should never be dealing with this value, as I mentioned before, it's purely a mathematical trick for the search tree, so there is no sense in talking about it for thinking about how the neural net learns. What does "always returns the same value" mean?

Oh this was because there was a bug. As I said above, I tried to use a large scoreMeanSq value that did not go through getScoreStdev to get the scoreUtility, which means that the same scoreUtility always comes out when it is the same scoreMean. Or, my English is not enough, so the meaning seems to be mistransmitted. Sorry :)

I'm not sure what the moving average loss you are talking about is - could you clarify? SWA of course should not have any effect on the losses you see, SWA literally does not affect the training in any way. I'm not exactly sure what would cause the loss pattern you observe though, and unfortunately I don't know if the loss in my training run even follows the same pattern or not either, since I don't pay much attention to the beginning of training in KataGo any more - it proceeds so fast that even on 19x19, the bot is strong human amateur dan within the first few days without many GPUs (and much less than 1 day with the normal cluster of GPUs that Kata has been using).

The moving average in the code below was referred to as moving average loss. I am not very familiar with this. Could you please explain how this works? Or is there a related paper or explanation somewhere?

Maybe a further question if you're seeing training instability at the start of your run - have you tried adjusting the learning rate and/or looking at the magnitude of gradients and using gradient clipping? Also, are there major architectural differences with your net? For example, I found that once I switched KataGo to no longer use batch norm, gradient clipping was needed to prevent instability, but other than that it learned quite fine.

The net I am experimenting with is different from katago, only the value, misc, and score head parts are the same. The learning rate has not been sufficiently adjusted due to lack of computing resources. And I always see the gradient value as a tensorboard, but I have never used a gradient clip. I will try it later. Thank you.

I have additional questions. Why do you scale to scale_initial_weights when initializing weights in model.py? Is it to prevent this because l2 loss acts a lot in early training? The second question is, I know that old papers and old code do not reuse trees in self-play, is that still true? One last question. Is there any way to use it to check if score training is good? I'm simply taking several games and measuring the huber loss.

The story got longer. Thanks again for your kind answer.

(edit: fixed additional questions)

Just realizing I never replied back here. Some brief answers to some of your questions:

If you're asking about the unowned_proportion and moving_unowned_proportion and such, those are special targets only to help the net learn special cases in Japanese rules. I'm not confident that they're needed even, but it probably doesn't hurt either. Please disregard them.
Multiplying the score outputs by 20 or other large numbers is so that the neural net on its own doesn't have to generate large activation values for those outputs. The common wisdom is that neural nets work best if the outputs it has to generate for different things all are within a similar order of magnitude.
Almost all of the initial weight scaling that is important for the net is following the recommendations of this paper: https://arxiv.org/abs/1901.09321 which enables successful training of deep neural networks without batch normalization. Eliminating batch norm brings some other nice benefits, such as not having to worry about batch statistics issues like Leela Chess Zero ran into, faster training, and less complex hyperparameters (effect of batch size is much less conceptually confusing now). Besdies that paper, there is some scaling of weights on output heads to be smaller which is probably unnecessary, it's legacy code from early experiments, but it probably makes earlier training slightly more stable by making it so when the neural net is randomly intialized and is very, very wrong at all predictions, it's only making small wrong predictions, instead of big wrong predictions.
Older papers about AlphaZero are almost all ambiguous on tree reuse. Probably many of them did, and probably many of them didn't, but few of them say whether they did or didn't. Tree reuse or no tree reuse are both fine if you have neural net caching, and have similar performance, but unfortunately they make about a factor of 2 difference in interpreting people's hyperparameters. This is one area where most paper authors focused too much on the high level research concepts and not enough on the implementation details. when in fact the implementation details actually make a difference in interpreting what actual parameters they used. The same goes for utility in [0,1] versus [-1,1] - some papers are ambiguous, and obviously both choices will work, but it makes it impossible to tell if the paper used a given hyperparameter as X, or 2X, or 0.5X.

Thank you for your kind answer. It was very helpful. After experimenting with the advice you have given, I will ask for help again later. :)

lightvector / KataGo

question about scoreStdev and training #200