lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.58k stars 565 forks source link

question about scoreStdev and training #200

Closed isseebx123 closed 4 years ago

isseebx123 commented 4 years ago

First of all, English is not my first language so please excuse any mistakes.

I'm doing a lot of research with some reference to your papers and code, and I'd like to get advice regarding score maximization and training. I'm not using the same code as katago, so please let me know if there's something I'm missing.

First, it seems that the scoreMean and scoreMeanSq values ​​are used for score maximization. scoreMeanSq uses the scoreMean * scoreMean + scoreStdev + scoreStdev values ​​according to the latest version, so I am wondering why scoreMeanSq is used. Can't we just use scoreStdev in getScoreUtility?

scoreMeanSq = scoreMean * scoreMean + scoreStdev * scoreStdev;

Second, there is a problem that the output value of score_stddev continues to grow while training. In my first xavier randomly initialized model, the scoreMean is -1.46 and the scoreStdev is 14 on the blank board. According to the formula mentioned above, scoreMeanSq already has a value of around 200. However, as I train, the scoreStddev gets bigger, so the scoreMeanSq is greater than 10000. As expected, in expectedWhiteScoreValue function  If the scoreMeanSq value increases over a certain amount, it always returns the same value. I think this is also a cause of not training. Do you have any opinions regarding this? Or is there any skill that scoreStdev can train reliably?

Lastly, not only scoreStdev, but also score-related losses such as pdf and cdf tend to increase after the loss drops at the beginning of training. I wonder if you have had this kind of experience. SWA, which appeared to have been applied to katago, has not been applied yet. And the moving average loss was excluded because it seemed that the training seemed to be worse. Will these two have a big impact on training? Or any comments would be appreciated.

lightvector commented 4 years ago

I'm doing a lot of research with some reference to your papers and code, and I'd like to get advice regarding score maximization and training. I'm not using the same code as katago, so please let me know if there's something I'm missing.

First, it seems that the scoreMean and scoreMeanSq values ​​are used for score maximization. scoreMeanSq uses the scoreMean * scoreMean + scoreStdev + scoreStdev values ​​according to the latest version, so I am wondering why scoreMeanSq is used. Can't we just use scoreStdev in getScoreUtility?

We could, but the point of scoreMeanSq is so that we are separately also able to accumulate the second moment of score in the search tree as an interesting stat. In general, given two random distributions X and Y, if we let Z be the mixture distribution that is X with probability p and Y with probability 1-p, then: E[Z] = p E[X] + (1-p) E[Y] ("first moment, i.e. mean") E[Z^2] = p E[X^2] + (1-p) E[Y^2] ("second raw moment") whereas neither of the following are true: Stdev[Z] = p Stdev[X] + (1-p) Stdev[Y] Var[Z] = p Var[X] + (1-p) Var[Y]

So it's mathematically super-convenient for summing across the search tree to just use the second raw moment (there is also a cost of some numeric precision here but that's no big deal in practice). But this has absolutely nothing to do with training. The neural net reports stdev, not the second raw moment, and the neural net never ever sees the second raw moment or anything like it - it's just a mathematical trick.

Second, there is a problem that the output value of score_stddev continues to grow while training. In my first xavier randomly initialized model, the scoreMean is -1.46 and the scoreStdev is 14 on the blank board. According to the formula mentioned above, scoreMeanSq already has a value of around 200. However, as I train, the scoreStddev gets bigger, so the scoreMeanSq is greater than 10000. As expected, in expectedWhiteScoreValue function

Is the standard deviation of the score of your self-play game results as large as 100 points? With random play, the game can be easily lost or won with one player controlling almost the whole board or the other one just by chance. So if you are starting out learning from random play, the standard deviation being 100s of points is correct, and you should be prepared for this, by using an activation function capable of reporting these values, and by setting the scaling of that output so that outputting such a value does not require too absurd an internal activation strength within the neural net.

 If the scoreMeanSq value increases over a certain amount, it always returns the same value. I think this is also a cause of not training. Do you have any opinions regarding this? Or is there any skill that scoreStdev can train reliably?

I still don't understand why you are talking about scoreMeanSq in the context of the neural net. The neural net should never be dealing with this value, as I mentioned before, it's purely a mathematical trick for the search tree, so there is no sense in talking about it for thinking about how the neural net learns. What does "always returns the same value" mean?

Lastly, not only scoreStdev, but also score-related losses such as pdf and cdf tend to increase after the loss drops at the beginning of training. I wonder if you have had this kind of experience. SWA, which appeared to have been applied to katago, has not been applied yet. And the moving average loss was excluded because it seemed that the training seemed to be worse. Will these two have a big impact on training? Or any comments would be appreciated.

I'm not sure what the moving average loss you are talking about is - could you clarify? SWA of course should not have any effect on the losses you see, SWA literally does not affect the training in any way. I'm not exactly sure what would cause the loss pattern you observe though, and unfortunately I don't know if the loss in my training run even follows the same pattern or not either, since I don't pay much attention to the beginning of training in KataGo any more - it proceeds so fast that even on 19x19, the bot is strong human amateur dan within the first few days without many GPUs (and much less than 1 day with the normal cluster of GPUs that Kata has been using).

(edit: fixed typos in math) (edit: a few more clarifications)

lightvector commented 4 years ago

There were certainly parts of your message that weren't clear to me. But let me know if that helps, or if I misunderstood anything you were saying, happy to explain further. :)

lightvector commented 4 years ago

Maybe a further question if you're seeing training instability at the start of your run - have you tried adjusting the learning rate and/or looking at the magnitude of gradients and using gradient clipping?

Also, are there major architectural differences with your net? For example, I found that once I switched KataGo to no longer use batch norm, gradient clipping was needed to prevent instability, but other than that it learned quite fine.

isseebx123 commented 4 years ago

First of all, thanks for the quick and friendly answer. Based on what you said, I have been delayed to do some experiments. Sorry for the late response.

We could, but the point of scoreMeanSq is so that we are separately also able to accumulate the second moment of score in the search tree as an interesting stat. In general, given two random distributions X and Y, if we let Z be the mixture distribution that is X with probability p and Y with probability 1-p, then: E[Z] = p E[X] + (1-p) E[Y] ("first moment, i.e. mean") E[Z^2] = p E[X^2] + (1-p) E[Y^2] ("second raw moment") whereas neither of the following are true: Stdev[Z] = p Stdev[X] + (1-p) Stdev[Y] Var[Z] = p Var[X] + (1-p) Var[Y] So it's mathematically super-convenient for summing across the search tree to just use the second raw moment (there is also a cost of some numeric precision here but that's no big deal in practice). But this has absolutely nothing to do with training. The neural net reports stdev, not the second raw moment, and the neural net never ever sees the second raw moment or anything like it - it's just a mathematical trick.

My mathematical knowledge is weak, but I will try to understand it. The reason I was initially obsessed with scoreMeanSq was that I thought I would use this value as it is when calculating scoreUtility However, I confirmed that scoreMeanSq is getting the value of stdev using the getScoreStdev function, and the question I had was resolved somewhat.

Is the standard deviation of the score of your self-play game results as large as 100 points? With random play, the game can be easily lost or won with one player controlling almost the whole board or the other one just by chance. So if you are starting out learning from random play, the standard deviation being 100s of points is correct, and you should be prepared for this, by using an activation function capable of reporting these values, and by setting the scaling of that output so that outputting such a value does not require too absurd an internal activation strength within the neural net.

Yes. Starting with random play at first, self-playing games with large standard deviations are being created. I will report the value later. Is it okay to understand that scaling is correcting the 20 values ​​multiplied below, for example?

scorestdev_prediction = tf.math.softplus(miscvalues_output[:,1]) * 20.0

I still don't understand why you are talking about scoreMeanSq in the context of the neural net. The neural net should never be dealing with this value, as I mentioned before, it's purely a mathematical trick for the search tree, so there is no sense in talking about it for thinking about how the neural net learns. What does "always returns the same value" mean?

Oh this was because there was a bug. As I said above, I tried to use a large scoreMeanSq value that did not go through getScoreStdev to get the scoreUtility, which means that the same scoreUtility always comes out when it is the same scoreMean. Or, my English is not enough, so the meaning seems to be mistransmitted. Sorry :)

I'm not sure what the moving average loss you are talking about is - could you clarify? SWA of course should not have any effect on the losses you see, SWA literally does not affect the training in any way. I'm not exactly sure what would cause the loss pattern you observe though, and unfortunately I don't know if the loss in my training run even follows the same pattern or not either, since I don't pay much attention to the beginning of training in KataGo any more - it proceeds so fast that even on 19x19, the bot is strong human amateur dan within the first few days without many GPUs (and much less than 1 day with the normal cluster of GPUs that Kata has been using).

The moving average in the code below was referred to as moving average loss. I am not very familiar with this. Could you please explain how this works? Or is there a related paper or explanation somewhere?

Maybe a further question if you're seeing training instability at the start of your run - have you tried adjusting the learning rate and/or looking at the magnitude of gradients and using gradient clipping? Also, are there major architectural differences with your net? For example, I found that once I switched KataGo to no longer use batch norm, gradient clipping was needed to prevent instability, but other than that it learned quite fine.

The net I am experimenting with is different from katago, only the value, misc, and score head parts are the same. The learning rate has not been sufficiently adjusted due to lack of computing resources. And I always see the gradient value as a tensorboard, but I have never used a gradient clip. I will try it later. Thank you.

I have additional questions. Why do you scale to scale_initial_weights when initializing weights in model.py? Is it to prevent this because l2 loss acts a lot in early training? The second question is, I know that old papers and old code do not reuse trees in self-play, is that still true? One last question. Is there any way to use it to check if score training is good? I'm simply taking several games and measuring the huber loss.

The story got longer. Thanks again for your kind answer.

(edit: fixed additional questions)

lightvector commented 4 years ago

Just realizing I never replied back here. Some brief answers to some of your questions:

isseebx123 commented 4 years ago

Thank you for your kind answer. It was very helpful. After experimenting with the advice you have given, I will ask for help again later. :)