Network asmFish can't compile

lantonov commented 7 years ago

I merged the network asmFish from the isolated repository https://github.com/tthsqe12/asm to branch 'network' of the main repository: https://github.com/lantonov/asmFish/tree/network. However, it could not compile with the error:

     flat assembler  version 1.71.57  (50000 kilobytes memory)
     asmFish\guts/Uci.asm [116]:
     GD String, 'processing cmd line command: '
     processed: GD String,'processing cmd line command: '
     error: illegal instruction.

I checked lest this was an artifact of the merging process, but it is not: files downloaded from the original repository https://github.com/tthsqe12/asm as a zip file, and unzipped on the HD, give the same error when trying to compile. I think some definition or macro upstream is missing.

tthsqe12 commented 7 years ago

oh no i didn't mean for this to be used yet. Be patient while I turn this into something that is actually useful

lantonov commented 7 years ago

Thanks again. I would be glad to help with anything I can: GitHub, formatting, search, testing, etc.

Ipmanchess commented 7 years ago

Count me in for Testings ,if needed ;)

tthsqe12 commented 7 years ago

so im training a 256 - 128 - 64 - 1 net from stockfish's evaluation function. The first two layers have a 'ramp' function, and the last layer (which produces the output) has no such activation function. After a couple of 'go depth 25', the net is averaging 60cp from stockfish's evaluation. Every time evaluate is called, the net is trained, which seems to slow it down from 2Mnps to 130Knps. Not sure if this is the right approach

Ipmanchess commented 7 years ago

I don't know if it's usefull for you..there are some interesting talks on Talkchess busy about parallel speedup & deep learning(Deep Pink) : http://www.talkchess.com/forum/viewforum.php?f=7 Is this training for example possible to let it do by a grafik card using all these cores(Cuda) so that asmFish still has full cpu power..?

lantonov commented 7 years ago

Just to get on the same page: this is a network with 256 input nodes, 2 hidden layers (128 nodes and 64 nodes) and output layer (1 node). The ramp functions act between the input layer and the first hidden layer and between the first hidden layer and the second hidden layer. Is that right? For the speed and effectiveness of training, of utmost importance is the logical structuring of the input especially wrt expected output. The GIGO problem in NN is huge. Input can be structured in indefinitely many ways, some better than others. Previously, I listed some possible inputs without any claim that those are the best, however. The type of activating function is also important. If by 'ramp' function is understood the function f(x) = max(0,x), it is the same as the Rectifier Linear Unit (ReLU) mentioned above in the Mathew Lai approach. ReLU is preferred over the common logistic / tanh function because of the following features that allow for faster and more efficient training of deep neural architectures on large and complex datasets:

One-sided (0,∞) compared to the antisymmetry of tanh (-1,1)
Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).
Efficient gradient propagation: No vanishing or exploding gradient problems.
Efficient computation: Only comparison, addition and multiplication.
Scale-invariant: max(0,ax)=amax(0,x).

However, there are also potential problems with ReLU:

Non-differentiable at zero: however it is differentiable anywhere else, including points arbitrarily close to (but not equal to) zero.
Non-zero centered
Unbounded : Could potentially blow up.
Dying ReLU problem for high learning rates. If all inputs put the ReLU on the flat side, there's no hope that the weights change at all and the node is dead. A ReLU may be alive then die due to the gradient step for some input batch driving the weights to smaller values, making the subsequent values < 0 for all inputs. A large learning rate amplifies this problem.

The slowdown of the NN in terms of Mnps during training is of no particular concern compared to evaluation error when one takes into account that the NN does static evaluation (input a position, output a value, one time) while SF evaluation at depth=25 is dynamic: it searches through 25 leaf nodes (layers) and internally crunches many intermediate evaluations of the resulting positions by essentially similar to NN approach (forward and back propagation). Increasing NN speed is linear while increasing depth is exponential wrt to processing power. As concerns GPUs, it has been proven both in theory and in practice that those can greatly speed up neural networks. Assembly may have problems with GPU, however, because they are different from CPU (different registers, different ALU, different FPU, etc.). There may be problems with portability (GPU from different manufacturers).

Ipmanchess commented 7 years ago

Maybe interesting: http://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf

lantonov commented 7 years ago

I was just reading it. Very complicated and slow NN, but we may arrive to it eventually, God forbid. The idea in the thread to use tablebases for training is very good, though. TB's eval (-1,0, or 1) is as reliable as can be. Of course, TB positions would be in addition to opening, middlegame and early endgame positions.

tthsqe12 commented 7 years ago

So i originally put kings on the same 64 inputs as pawns, but i just didn't like that aesthetically. So now there are 320 inputs with queens combined in rooks and bishops. I don't like +1, and -1 inputs, so I might try 640 inputs also.

lantonov commented 7 years ago

I will study these some more. Just made a commit for the last SF change "Simplify scale factor computation" with only removal Evaluate.asm:1703-1704. Can you please check it? Bench didn't change as should be.

lantonov commented 7 years ago

The activation function that you use is not ramp but it's very close to logistic in properties. The ramp function looks like a ramp: and the function f(x) = (|x|+x+1)/2/(|x|+1) is a sigmoid f(x) = (|x|+x+1)/2/(|x|+1) larger domain very similar to the logistic function Logistic

The derivative of f(x) = (|x|+x+1)/2/(|x|+1) is f'(x) = 1/2/(|x|+1)^2 and looks like: which is roughly similar to the derivative of the logisic (f'(x) = exp(x)/(exp(x)+1)^2)

The first layer matrix shows some periodicity which almost disappears in the second layer matrix and the vector fed to the output. This, if anything, shows that the NN is working towards some transformation of the input.

tthsqe12 commented 7 years ago

Sorry, I accidentally put my logistic function (which i used for mnist data) there instead of my ramp function. This is the corrected version After many games and training on each call to evaluate, the net output was still around 60cp from sf's evaluation, so i can say the net is basically producing junk.

As for the latest patch, i do not like it as the comment " remove one condition which can never be satisfied." is false. The one who made this patch failed to realize that the condition is sometime true (about 5% on bench), but when it is true the result of the scale factor computation is multiplied by zero. It is this kind of carelessness that is worrying about the stockfish project.

lantonov commented 7 years ago

f(x)=(|x|+1)/((|x|-x+1)^2+1) is indeed a ramp function Unlike the usual ramp function f(x)=max(0,x), the above function is smooth (has derivative on the whole domain). The derivative is (-2 x |x| + |x|^3 + |x| - x^3 + 2 x^2)/(2 |x| (-x |x| + |x| + x^2 - x + 1)^2)

Don't be discouraged by the non-convergence at the first try. I don't know of any NN that has succeeded from the start. All require many modifications and parameter tuning.

The last patch didn't change bench so I will revert it right away.

tthsqe12 commented 7 years ago

nono, you don't need to revert it! It is actually faster by a little bit. I was just saying the comment was misleading.

lantonov commented 7 years ago

Sorry, I acted too quickly. Reinstating. git reset --hard HEAD^ git push origin master -f

lantonov commented 7 years ago

The greatest impact on performance is caused by the structure of the input layer. The board representation is very similar to the basic example of number recognition. The number images are coded in an 8x8 matrix. The squares of the board completely correspond to such matrix. The "pixels" / squares shade can correspond to pieces and on the input nodes (not more than 70) you can put floating point numbers like 0. (empty), 1. (pawn), 2. (knight), 3. (bishop), 4. (rook), 5. (queen), 6. (king). The only difference with the image recognition will be that this is a regressor and not classifier. I think that 320 or 640 input nodes are too much and instead of giving the NN the board pattern they confuse it with too much uninformative connections. Training should start not with games but with feeding boards on the input and reference evaluations on the output to allow the network to create its proper weights through propagation.

lantonov / asmFish

Network asmFish can't compile #27