Closed lantonov closed 7 years ago
oh no i didn't mean for this to be used yet. Be patient while I turn this into something that is actually useful
Thanks again. I would be glad to help with anything I can: GitHub, formatting, search, testing, etc.
Count me in for Testings ,if needed ;)
so im training a 256 - 128 - 64 - 1 net from stockfish's evaluation function. The first two layers have a 'ramp' function, and the last layer (which produces the output) has no such activation function. After a couple of 'go depth 25', the net is averaging 60cp from stockfish's evaluation. Every time evaluate is called, the net is trained, which seems to slow it down from 2Mnps to 130Knps. Not sure if this is the right approach
I don't know if it's usefull for you..there are some interesting talks on Talkchess busy about parallel speedup & deep learning(Deep Pink) : http://www.talkchess.com/forum/viewforum.php?f=7 Is this training for example possible to let it do by a grafik card using all these cores(Cuda) so that asmFish still has full cpu power..?
Just to get on the same page: this is a network with 256 input nodes, 2 hidden layers (128 nodes and 64 nodes) and output layer (1 node). The ramp functions act between the input layer and the first hidden layer and between the first hidden layer and the second hidden layer. Is that right? For the speed and effectiveness of training, of utmost importance is the logical structuring of the input especially wrt expected output. The GIGO problem in NN is huge. Input can be structured in indefinitely many ways, some better than others. Previously, I listed some possible inputs without any claim that those are the best, however. The type of activating function is also important. If by 'ramp' function is understood the function f(x) = max(0,x), it is the same as the Rectifier Linear Unit (ReLU) mentioned above in the Mathew Lai approach. ReLU is preferred over the common logistic / tanh function because of the following features that allow for faster and more efficient training of deep neural architectures on large and complex datasets:
However, there are also potential problems with ReLU:
The slowdown of the NN in terms of Mnps during training is of no particular concern compared to evaluation error when one takes into account that the NN does static evaluation (input a position, output a value, one time) while SF evaluation at depth=25 is dynamic: it searches through 25 leaf nodes (layers) and internally crunches many intermediate evaluations of the resulting positions by essentially similar to NN approach (forward and back propagation). Increasing NN speed is linear while increasing depth is exponential wrt to processing power. As concerns GPUs, it has been proven both in theory and in practice that those can greatly speed up neural networks. Assembly may have problems with GPU, however, because they are different from CPU (different registers, different ALU, different FPU, etc.). There may be problems with portability (GPU from different manufacturers).
Maybe interesting: http://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf
I was just reading it. Very complicated and slow NN, but we may arrive to it eventually, God forbid. The idea in the thread to use tablebases for training is very good, though. TB's eval (-1,0, or 1) is as reliable as can be. Of course, TB positions would be in addition to opening, middlegame and early endgame positions.
So i originally put kings on the same 64 inputs as pawns, but i just didn't like that aesthetically. So now there are 320 inputs with queens combined in rooks and bishops. I don't like +1, and -1 inputs, so I might try 640 inputs also.
I will study these some more. Just made a commit for the last SF change "Simplify scale factor computation" with only removal Evaluate.asm:1703-1704. Can you please check it? Bench didn't change as should be.
The activation function that you use is not ramp but it's very close to logistic in properties. The ramp function looks like a ramp: and the function f(x) = (|x|+x+1)/2/(|x|+1) is a sigmoid f(x) = (|x|+x+1)/2/(|x|+1) larger domain very similar to the logistic function Logistic
The derivative of f(x) = (|x|+x+1)/2/(|x|+1) is f'(x) = 1/2/(|x|+1)^2 and looks like: which is roughly similar to the derivative of the logisic (f'(x) = exp(x)/(exp(x)+1)^2)
The first layer matrix shows some periodicity which almost disappears in the second layer matrix and the vector fed to the output. This, if anything, shows that the NN is working towards some transformation of the input.
Sorry, I accidentally put my logistic function (which i used for mnist data) there instead of my ramp function. This is the corrected version After many games and training on each call to evaluate, the net output was still around 60cp from sf's evaluation, so i can say the net is basically producing junk.
As for the latest patch, i do not like it as the comment " remove one condition which can never be satisfied." is false. The one who made this patch failed to realize that the condition is sometime true (about 5% on bench), but when it is true the result of the scale factor computation is multiplied by zero. It is this kind of carelessness that is worrying about the stockfish project.
f(x)=(|x|+1)/((|x|-x+1)^2+1) is indeed a ramp function Unlike the usual ramp function f(x)=max(0,x), the above function is smooth (has derivative on the whole domain). The derivative is (-2 x |x| + |x|^3 + |x| - x^3 + 2 x^2)/(2 |x| (-x |x| + |x| + x^2 - x + 1)^2)
Don't be discouraged by the non-convergence at the first try. I don't know of any NN that has succeeded from the start. All require many modifications and parameter tuning.
The last patch didn't change bench so I will revert it right away.
nono, you don't need to revert it! It is actually faster by a little bit. I was just saying the comment was misleading.
Sorry, I acted too quickly. Reinstating. git reset --hard HEAD^ git push origin master -f
The greatest impact on performance is caused by the structure of the input layer. The board representation is very similar to the basic example of number recognition. The number images are coded in an 8x8 matrix. The squares of the board completely correspond to such matrix. The "pixels" / squares shade can correspond to pieces and on the input nodes (not more than 70) you can put floating point numbers like 0. (empty), 1. (pawn), 2. (knight), 3. (bishop), 4. (rook), 5. (queen), 6. (king). The only difference with the image recognition will be that this is a regressor and not classifier. I think that 320 or 640 input nodes are too much and instead of giving the NN the board pattern they confuse it with too much uninformative connections. Training should start not with games but with feeding boards on the input and reference evaluations on the output to allow the network to create its proper weights through propagation.
I merged the network asmFish from the isolated repository https://github.com/tthsqe12/asm to branch 'network' of the main repository: https://github.com/lantonov/asmFish/tree/network. However, it could not compile with the error:
I checked lest this was an artifact of the merging process, but it is not: files downloaded from the original repository https://github.com/tthsqe12/asm as a zip file, and unzipped on the HD, give the same error when trying to compile. I think some definition or macro upstream is missing.