Regression Network with Multiple Outputs and UCI Fast Reader

microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

https://docs.microsoft.com/cognitive-toolkit/

Other

17.5k stars 4.29k forks source link

Regression Network with Multiple Outputs and UCI Fast Reader #97

Closed amirbegan closed 8 years ago

amirbegan commented 8 years ago

Hi, I am unable to get a regression network with multiple outputs to work using UCIFastReader. Is this possible?

If I define a network with 2 output nodes like: SimpleNetworkBuilder = [ layerSizes = 12:2 ...

and a UCIFastReader reader labels section as: labels = [ labelType = "regression" dim = 2 start = 12 labelMappingFile = "$DataDir$/label-mapping.txt" ]

...where label-mapping.txt is an empty file.

When I run the train command it ends with the following: EXCEPTION occurred: NotifyFunctionValuesMBSizeModified: labels InputValue operation had its row dimension 2 changed by the reader to 1.

If I change config to a network with one output node and change the reader section to dim = 1, then it works, so I think this somehow has to do with not being able to use the UCIFastReader for regression with multiple outputs, if that is the case - is there a way to do this with some different reader?

dongyu888 commented 8 years ago

I have pushed a change at branch dongyu/UCIFastReaderFix. can you check whether it addressed your problem?

Be noted that in the regression mode you don't need the label mapping file and it has no effect if it's there.

amirbegan commented 8 years ago

Thanks for the change.

I've tried the branch, the error is gone, but it does not work as I expect; on a simple test case with 1 input and 2 outputs, when I write out the predictions both outputs have very similar values while I expect the two outputs to be different significantly. It is as if both outputs have actually been trained on the same column of data rather than two different columns.

As for the labelMappingFile and regression, I get the following exception if I do not specify the labelMappingFile value for the reader: EXCEPTION occurred: configparameters: required parameter missing: SimpleWithOneAmir.config:SimpleWithOne_Demo_Train:reader:labels:labelMappingFile

dongyu888 commented 8 years ago

can you send me your setup to reproduce your problem? at least for the label mapping it should work.

Thanks, Dong Yu (俞栋)Sent from my smart phone. Please forgive my typos.

-------- Original message --------

From: amirbegan

Date: 2/18/2016 12:11 (GMT-08:00)

To: Microsoft/CNTK

Cc: Dong Yu

Subject: Re: [CNTK] Regression Network with Multiple Outputs and UCI Fast Reader (#97)

Thanks for the change.

Reply to this email directly or view it on GitHub: https://github.com/Microsoft/CNTK/issues/97#issuecomment-185895288

amirbegan commented 8 years ago

Here is the setup, let me know if you need anything else.

auto-mpg-02.zip

dongyu888 commented 8 years ago

I ran your setup and did not observe the labelmappingfile issue. I tried removing lablemappingfile and labelDim (labelDim is only valid if you have categorical label in which case you can only use one column as the label in the csv file).

I did notice a bug when copying data from the parser to input nodes when there are multiple regression label columns. This is now fixed. Can you please try dongyu/UCIFastReaderFix again?

amirbegan commented 8 years ago

Tried the fix, works properly on my test case.

Also, as you mentioned, works fine if I do not specify lableMappingFile and labelDim.

Thank you.

jsboige commented 8 years ago

Hi, I don't have the possibility to build the corresponding branch atm. Considering I'm also trying to perform a multidimensional regression and have the same "dimension x changed by the reader" exception, should I assume that won't work until next release? Any other option or a possibility to get my hands on a Build? Thanks in advance

frankseide commented 8 years ago

The fix was done 22 days ago. This is likely in the latest binary already. @dongyu888, could you confirm?

dongyu888 commented 8 years ago

Actually I don’t know. Alexey, do you know?

jsboige commented 8 years ago

well the latest Binary is older than the fix is that right? Anyway, thanks for your quick answer. Maybe you can review my setup: I'm trying to make a variation of that model in TF to evaluate Board Go positions for end game territories.

I'm in c# so I wrote an alternate script to produce my own data. My training tile has game records with BoardSize = 19*19 = 361 columns for end game territories label vector ( values -1,0,1) and another 361 feature vector for mid game position ( values -1,0,1) for so my reader's:

 features = [
            dim = 361
            start = 361
        ]

        labels = [
            dim = 361
            start = 0
          #  labelDim = 3
            labeltype = regression
            labelMappingFile = "$DataDir$/labelsmap.txt"
        ]

If I comment the last line I get "parameter missing (...) labelmappingFile"

If I don't I get

EXCEPTION occurred: NotifyFunctionValuesMBSizeModified: features InputValue operation had its row dimension 784 changed by the reader to 361.

Now I'm also not very sure with my arrangement of convoluted layers and how to best re-project into the original label with an image dimension while preserving locality. Of course, knowing if there's a bug around would help to start with, but if you've got advises on how to handle that part it would be very nice. Will I get the same error if I get my dimensions wrong?

I started from the mnist example, with:

labels = ImageInput(imageW, imageH, 1, tag = label, imageLayout=$imageLayout$) I commented pooling to start with, and I'll add maps when I get this first one right, now I suppose the end layers should look like :

    h1Dim = 361
    # DNNSigmoidLayer and DNNLayer are defined in Macros.ndl
    # h1 = DNNSigmoidLayer(512, h1Dim, pool2, 1)
    h1 = DNNSigmoidLayer(zzz, h1Dim, conv2_act, 1)
    ol = DNNLayer(h1Dim, 361, h1, 1)

    ce = CrossEntropyWithSoftmax(labels, ol)
    err = ErrorPrediction(labels, ol)

Is that right? To get zzz, should I apply eq. (2.156) from the book and the output layer of the successive convoluted layers? Also, what should h1Dim be? Having comments about that computation in the code would be nice. Would you walk me to the appropriate ndl ?

Thanks in advance

dongyu888 commented 8 years ago

You may need to build the current master branch.

jsboige commented 8 years ago

My development environment is a virtual machine (no GPU) with only Vs 2015 installed and I wasn't able to build from source I'm sorry (many error messages starting from the Math project). Would you help me out with that?

frankseide commented 8 years ago

You may not need it (yet). The error in NotifyFunctionValuesMBSizeModified() does actually not indicate a problem with your tensor dimensions. Rather, this error is triggered when an InputValue node (called "features" in your case) was declared and used by the network, but not filled by the reader. I will refine that error message.

So we must first find out why the reader is not filling the "features." First, make sure that your reader definition contains sections for both 'labels' and 'features', e.g. like 01_OneHidden.cntk. These section names are matched with node names.

Next, I notice that you are reading the images into 'labels' instead. You should read them into 'features'. Please try to change your "labels=" into "features=", and also change the tag to "features" (or alternatively say FeatureNodes=(features)".

For training, you still need labels. Please try to define an InputValue named 'labels', like the MNIST sample 01_OneHidden.ndl (in the section called ndlMnistMacros). Please don't, however, copy the definition of "features" from there, keep using your InputImage.

The second argument to DNNLayer is the label dimension. For MNIST, that should be 10, not 361. Sorry, the layers take their dimension parameters in the order "inputDim, outputDim", which is reverse to the matrix dimension (rows=outputDim, cols=inputDim). It confuses me too sometimes.

Your NDL seems fine otherwise. To set zzz, I would go the simple route, let CNTK give you an error message and get it from there. I.e., leave it at 512 (or anything), run CNTK, get it to the point where it fails due to a dimension mismatch in a matrix product (:Times operation") in this layer (the error message will tell you a node name that will indicate whether it is indeed the node in this DNNSigmoidLayer that is failing). The error message will also tell you the tensor dimensions of both inputs of the Times node, something like [N x 512] * [W x H x C x ] --> [N x ]. The values for (W,H,C) that are shown are the actual dimension of the data that the weight matrix of the DNNSigmoidLayer is applied to, and you need to modify zzz to match that.

I notice one error. The OneHidden sample may be outdated, as the weight matrix really must be a tensor (we recently changed that). So you will need to change the DNNSigmoidLayer to a DNNImageSigmoidLayer, which is defined in the Macros.ndl file in the same folder with this signature:

DNNImageSigmoidLayer(inW, inH, inC, outDim, x, parmScale) = ...

You need to set inW/H/C to match the dimensions of the second argument in the error message you will get for the Times operation. (Again, the parameter order is even more confusing, since W,H,C are inputs. The tensor dimension will be [outDim x inW x inH x inC].)

(I will also track down why this was not caught in automated tests.)

h1Dim is a design question. The MNIST sample 01_OneHidden uses 200. In the end, you will use larger training sets, so you would want to increase this and also the depth.

If you get errors related to convolution engines (a typical error), the imageLayout must be set correctly, that is, run on GPU (deviceId>=0) and set imageLayout="cudnn" in all convolution-related nodes. A way to know which nodes must have that parameter is to search through the example NDL files in the MNIST folder for the string 'imageLayout'.

Since this is your first setup, expect to run into more errors. Don't hesitate to ask (maybe open a new Issue if the topic changes).

frankseide commented 8 years ago

Yikes, re-reading your mail, I realize the error indicates something slightly different.

row dimension 784 changed by the reader to 361

means the reader thinks the features node has dimension 361 (you indeed say so in your reader definition), while the features node itself is declared to be 784-dimensional.

But you declared your labels to be 19 x 19 (I presume that is correct in your scenario). Did you also declare the features to be 19 x 19? 784 is the value I see in the MNIST sample (=28 x 28). Please also adapt the definition of the features node. It is not enough to adapt it in the reader. Rather, reader and node dimensions must match.

jsboige commented 8 years ago

Thanks a lot for your thorough explanation. I'm not there yet, but I somehow managed to move on a bit I believe. Several points then:

I was able to build the latest source (Debug_CPUOnly) on a different environment, though the resulting binaries wouldn't work once copied to my development VM environment. When trying on the new environment, I did not get the error mentioning the missing labelmappingFile with that line commented in the reader section, so there seems to be a significant difference between release binaries and the build.
As per your advices, I have updated the macro.ndl file from the latest MNIST model source version and turned the DNNSigmoidLayer from the last layer to a DNNImageSigmoidLayer
From your 2nd message, I think you got that both my features and my labels are 1 channel maps with the size as the board 19 x 19 = 361, so the training file has rows with 361 *2 = 722 columns, and the features definition, which I ommited in my initial message, is also features = ImageInput(imageW, imageH, 1, tag = feature, imageLayout=$imageLayout$)
Somehow, the NotifyFunctionValuesMBSizeModified did vanish on both environment, I got to Times "operation error" to fix the product dimension as per your advice, which I did with h1 = DNNImageSigmoidLayer(11, 11, cMap2, h1Dim, conv2_act, 1)

I suppose that means that both convolutions with 5*5 kernels without stride applied from the original MNIST convolution example reduced the dimension by 5-1=4 each, which is consistent with eq. (2.156). On that note, the original article mentions an addition feature

final plane is all 1's so the convolutions can detect the edge of the board

I noticed the "zeroPadding" parameter in the convolution function, would I be able to keep the 19*19 dimension using the board detection plane trick, with that parameter?

Anyway I'm not there yet, and just trying to get something with 1 feature plane and similar 1 label plane for now.

So here is where I'm stuck now.

In my dev environment with the release binaries, cntk crashes with:

(...)
command: train test
precision = float
CNTKModelPath: ../Output/Models/02_Convolution
CNTKCommandTrainInfo: train : 5
CNTKCommandTrainInfo: CNTKNoMoreCommands_Total : 5
CNTKCommandTrainBegin: train
NDLBuilder Using CPU
Reading UCI file ../Data/Train-1F.txt

[CALL STACK]
    >00007FF7FB00C453 (SymFromAddr error : Tentative dÔÇÖacc├¿s ├á une adresse n
on valide.)
(...)
    -00007FF7FB043768 (SymFromAddr error : Tentative dÔÇÖacc├¿s ├á une adresse n
on valide.)

EXCEPTION occurred: ImageParameter should have 3 parameters [width, height, numC
hannels].
(...)

On the new environment with the source build, the same training file can be read properly it seems, and the model validated, but I've got a crash a bit later with:

Starting Epoch 1: learning rate per sample = 0.005000  effective momentum = 0.000000  momentum as time constant = 0.0 samples
UCIFastReader: Starting at epoch 0, counting lines to determine record count...
 117 records found.
starting epoch 0 at record count 0, and file position 0
already there from last epoch

Starting minibatch loop.

About to throw exception 'Features matrix not found in config file, there should be a section 'features=[...]' in the configuration file.'

EXCEPTION occurred: Features matrix not found in config file, there should be a section 'features=[...]' in the configuration file.

[CALL STACK]
    > Microsoft::MSR::CNTK::UCIFastReader<float>::  GetMinibatchImpl
    - Microsoft::MSR::CNTK::UCIFastReader<float>::  GetMinibatch
    - Microsoft::MSR::CNTK::DataReader::  GetMinibatch
    - Microsoft::MSR::CNTK::DataReaderHelpers::GetMinibatchIntoNetwork<float>
    - Microsoft::MSR::CNTK::SGD<float>::  TrainOneEpoch
    - Microsoft::MSR::CNTK::SGD<float>::  TrainOrAdaptModel
    - Microsoft::MSR::CNTK::SGD<float>::  Train
    - DoTrain<Microsoft::MSR::CNTK::ConfigParameters,float>
    - DoCommands<float>
    - wmainOldCNTKConfig
    - wmain1
    - wmain
    - __tmainCRTStartup
    - wmainCRTStartup
    - BaseThreadInitThunk
    - RtlUserThreadStart

I'm not sure what I did wrong then, considering all we've said.

Thanks in advance for any follow up. And of course if you're up for it, I can generate and send you training files together with my configuration.

frankseide commented 8 years ago

Glad to see we are quite a bit further!

I think the crash in UCIFastReader was a known out-of-bounds access in the randomization code in case the specified epochSize parameter is larger than the actual training set. The latest code has a workaround, but that might cause slightly suboptimal randomization. So even if it passes now, please could you ensure that your SGD epochSize parameter matches the training set.

As for the frame of '1', did I understand right, you want to add a second 19 x 19 plane (like a color channel) with 1s in the boundary positions, and ask whether the use of zero padding will retain the 19 x 19 dimension? Yes, it will.

Lastly, that error string about the missing "Features matrix" is wrong (just fixed it, but not yet in master). If I read the code right, it rather indicates that your reader config is fine but there is no feature node named "features" in your network. A feature node is one tagged as "feature", either by saying InputValue(..., tag = "feature", ..) or saying something like this in the NDL:

    FeatureNodes = (features)
    LabelNodes = (labels)

Could you check whether your "features" node matches the name used in the reader config and is tagged as a feature?

jsboige commented 8 years ago

Thanks again for your renewed help.

I'm yet a little bit ahead, I was using featScaled= features (without parenthesis), taping directly into features seemed to do the trick

I regenerated a slightly larger dataset with a little more than 450 pro games for training and 150 for testing. I have set the epoch size to 450 and epoch 1 runs on the build env (not on the dev environment with release binaries and the same reader error), but with 100% error, and I get a crash at the beginning of epoch 2:

cntk_exception

Here is what I have in Vs when attaching to the process before hitting retry:

cntk-go-debug

For reference, here are my training and test datasets

Train-1F.txt Test-1F.txt

And here both cntk and ndl models.

02_Convolution.cntk.txt 02_Convolution.ndl.txt

Now I didn't think past the dimensions at this point, and I suspect the EvalErrPerSample = 1 means there is still something wrong with the end layers, which may lead to the crash. Could it come from the -1/0/1 values?
Finally about the edge thing. 0 stands for an empty cell on the board, so I suppose that zero padding, while retaining original size, will lead to implementing an edge less board a bit akin to torus Go, which really is a different game. I suppose adding a map with 1s covering the board serves marking the edges as walls rather than hidden empty cells.

jsboige commented 8 years ago

Also, on that last remark about exploiting the board geometry: ideally, on top of making sure the edges are properly accounted for, I'd want to make sure:

Symmetry invariance by 90° rotation is accounted for: would that involve generating symmetric samples in the dataset, or replicating features into additional maps with such rotations applied?
Information is propagated along vertical and horizontal connections only (Manhattan distance). Unlike in the game of life, there are not diagonal connections. I suppose that would mean removing kernel corners in convolution layers with 0 values. Is there any way to enforce such a mask during training?

Finally in the conclusion of the original example, the author mentions that implementing connectedness would be interesting: indeed, chains of connected stones of the same color (along vertical and horizontal lines) share the same status, which it seems was not properly learnt by the yet rather successful end model. I suppose having some kind of "chain pooling" as suggested would be nice indeed, what would be a good way to go about it.

BTW, if I get anywhere with that model, I'll be happy to contribute the final setup for instance if you want to introduce a Games examples section.

frankseide commented 8 years ago

Let me first reply to your last post. The ImageReader has abilities to crop and scale, but I don't think it can rotate. UCIFastReader has no such facilities. You would have to generate the data yourself.

As for filtering with masked corners, I think you should be able to use a convolution filter that is an element-wise product of a learnable parameter and a constant tensor, something like

maskedKernel = ElementTimes (kernel, mask)
out = Convolution (maskedKernel, ...).

To do that, you would have to:

create such a mask as a file
load it into a ImageParameter (..., init="fromFile", initFromFilePath="PATH_OF_THAT_FILE", learningRateMultiplier=0), where the last option makes it a constant (the predefined Constant() only works for scalars)

How to create the mask file: Let me assume that the mask has dimensions [W x H x C] but is identical for all C (if not, then it is a bit more complex). In that case, set C=1 and let ElementTimes broadcast it for you. Then you can just print out your W x H mask to an ASCII file. E.g. for a 3 x 3 filter with suppressed corners, you'd need 3 lines, one per matrix row, each with 3 values separated by whitespace:

0 1 0
1 1 1
0 1 0

Lastly, if you use Linux, you should be able to avoid writing a separate file by using a Perl-like inline pipeline command like this (I might have gotten the escaping wrong):

initFromFilePath="echo -e '0 1 0\n1 1 1\n0 1 0' |"

Under Windows, this works with Cygwin.

frankseide commented 8 years ago

As for the UCIFastReader crash, I noticed that your training set seems to have 458 samples, not 450. Please try to set epochSize to 458. Sorry, trying to work around that bug.

frankseide commented 8 years ago

Looking at your NDL, you might need to change your criterion. The CE criterion optimizes for classification and implies that you are telling it to predict one of 361 using a one-out-of-361 Softmax. Your label must be one-hot for that.

What is the objective you want to optimize?

jsboige commented 8 years ago

Thanks again, this is really useful.

Fixing the epoch number did work around the crash indeed.
And you're right I need to figure out my regression criterion. My objective is to minimize the difference between the desired board consisting of a 19*19=361 col label vector, and the output board from the 361 col output vector. I guess the equivalent of the loss function defined in the original article, could be attained with the SquareError function, is that right? Changing the CE criterion for that function did not seem to work though with 100% error again. Do you have examples for similar regressions?
The kernel mask file solution seems just what I need, thanks. I will look into it.
As for rotations, I suppose I can handle pre-processing the data. Again, I'm not sure whether the best solution is simply to generate distinct rotated samples, and let the training figure out the invariance, or to actually introduce additional rotated feature maps, with shared convolution weights to enforce the invariance, rotated through a matrix product into derived parameters. Is that possible to start with, and desirable then?
I realize this is probably also the solution for "chain pooling": Just like the original author divided the input boards into several maps to discriminate stone cells according to their immediate neighbor numbers, I could similarly introduce additional maps to discriminate the corresponding chains according to the number of connected stones they contain. However, I'm not sure whether increasing the number of significant feature planes this way will end up getting into the way of convergence at some point rather than making it easier, added to the fact that if I need to combine model evaluation with Monte-Carlo or iterated Alpha Beta search at some point, the additional pre-processing might have a significant penalty. Maybe you have an opinion about it.

frankseide commented 8 years ago

As for rotations, you probably also want mirroring? How about inversion (Black <-> White)? You'd be increasing your training data 16-fold.

Now that I think about it, technically you could describe rotation, mirroring, and inversion without extending your data. The idea would be that rotating or mirroring your data is equivalent to rotating or mirroring your kernels, if you connect the layers correctly. Rotated or mirrored kernels (or any permutation) can be created through matrix products on flattened tensors that can be back-propagated through. However, one reason why extending your data is a better solution is that it gives you better randomization (since the 16 variants will not always show up in the same minibatch).

The loss function can be achieved with SquareError() indeed. Alternatively, you can also say SumElements ((a - b) .* (a - b)) (for actual NDL you'd have to make this three lines).

I am not sure whether this is the correct loss function, though. You want to predict who might own the field in the next step, right? That sounds like a 3-class problem. Not knowing anything about Go, I would rather formulate this as a 3-way Softmax in each cell. However, CNTK cannot (yet) do that, Softmax always operates over the entire tensor (you could compute the softmax manually, but that would be involved). (Note that if it was a 2-class problem, they would be identical, as the gradient of Softmax and square error turn out to be the same, if I am not entirely mistaken.)

"Chain pooling" as you describe it could be done at a cost. E.g. a location may be surrounded by 0..8 neighbors. You could create 9 separate feature maps, somehow create a secondary map that counts the number of neighbors as a one-hot representation (need to think how, but sounds doable), and element-wise multiply with it. That multiply will gate the gradient to flow only into the filter with the matching number. Note that this will do much more computing that needed, as it would throw away (multiply with zero) 8 of 9 intermediate computation results.

You will not have a problem of convergence, but one of over-training. I would even now expect that you do, 450 samples seems very small, even if you increase it 16 times. Probably less so for the convolution filters, but more so for the dense layers. Any chance you can get more data? MUCH more?

jsboige commented 8 years ago

As for rotations, you probably also want mirroring? How about inversion (Black <-> White)? You'd be increasing your training data 16-fold.

That's right, so indeed we're talking about the 8 symmetries of the square + color inversion = 16 fold increase. For color inversion, technically it's not really the same position depending on the next player's turn and I suppose I should account for that, though I'm not sure about where to plug the corresponding feature.

Now that I think about it, technically you could describe rotation, mirroring, and inversion without extending your data. The idea would be that rotating or mirroring your data is equivalent to rotating or mirroring your kernels, if you connect the layers correctly. Rotated or mirrored kernels (or any permutation) can be created through matrix products on flattened tensors that can be back-propagated through. However, one reason why extending your data is a better solution is that it gives you better randomization (since the 16 variants will not always show up in the same minibatch).

That's what I was thinking of, and I guess it could be a future option to explore, but as per your last remark, I will stick with generating more samples and getting more familiar with manipulating NDLs beforehand.

BTW, I've acquired the dataset of professional games mentioned in the original article, with about 53000 records, so the 450 records were just to get started and get the error to move.

Any chance you can get more data? MUCH more?

My record selection (only those with more than a minimum total number of turns) is a little stricter than the original author's (no resignation games) but I'll definitely generate a lot more samples once the setup is correct. Also, as I understand the author did, on top of the 16 times increase from symmetries, I could also generate feature vectors from different mid game positions for the same game, with the same corresponding end-game territories labels.

I am not sure whether this is the correct loss function, though. You want to predict who might own the field in the next step, right?

We want to predict who owns the cells by the end of the game, that is either has an unsurroundable stone at the cell's position, or surrounds an empty cell's region with an unsurroundable chain of stones. Positions are rather static in go, unlike in chess; stones don't move and gradually divide up the board space into territories, so at an early stage, evaluating territories would look like one of the pictures in the original project.

That sounds like a 3-class problem.

I also would think a { -1 0 1} 3 class representation seems right, so I went for it, but the author instead went for a 2 class 0 = white, 1 = black representation.

However, CNTK cannot (yet) do that, Softmax always operates over the entire tensor (you could compute the softmax manually, but that would be involved). (Note that if it was a 2-class problem, they would be identical, as the gradient of Softmax and square error turn out to be the same, if I am not entirely mistaken.)

If I understand correctly, then I should go for the same 2-class {0,1} representation as the author's. Now I'm not sure how I should use the SquareError function. I tried

se = SquareError(labels, ol)
CriterionNodes = (se)

but it does not seem to be that, since my Error keeps at 1.0

"Chain pooling" as you describe it could be done at a cost. E.g. a location may be surrounded by 0..8 neighbors. You could create 9 separate feature maps, somehow create a secondary map that counts the number of neighbors as a one-hot representation (need to think how, but sounds doable), and element-wise multiply with it. That multiply will gate the gradient to flow only into the filter with the matching number. Note that this will do much more computing that needed, as it would throw away (multiply with zero) 8 of 9 intermediate computation results.

The author did actually divide the input into planes of stones with different "liberties" (empty neighbors) counts, regardless of chains. As for chains, as an early attempt, I suppose I could similarly start with adding feature planes with precomputed chains, discriminated by chain size, just like the author did with liberties. I understand from your remark that just like with symmetries, I could also get CNTK to actually enforce chain invariance directly through gating computation, which sounds nice.

It's probably overkill for now, and I'll need the appropriate criterion anyway before I can think of implementing those refinements.

Anyway, a big thanks again for getting though all of this. I hope I can contribute back interesting findings.

frankseide commented 8 years ago

Fascinating. Does the two-class representation maybe make more sense, since in the end, every field is owned by one of the two sides?

In that case, it sounds like you should use a Sigmoid per field, 0/1 labels, and the Logistic criterion:

// Logistic (labels, prediction, weight)
// calculates: -sum(left * log(right) + (1-left)*log(1-right)) (optionally * weight)

which is really the same thing as Softmax for a 2-class problem. There is a chance that the gradient just works out to square error.

What error stays at 1, training or testing error? How is the training loss progressing, getting better (then have patience), worse (then lower the LR), or staying put (then increase the LR)?

As a sanity check, can you try to replace SquareError (ol, labels) by

diff = Minus (ol, labels)
sqr = ElementTimes (diff, diff)
se = SumElements (sqr)

frankseide commented 8 years ago

Oh, and how do you count test errors? ErrorPrediction is not the right node. It matches CE and will select the one highest-scoring element of the frame, and compare whether the corresponding label is 1. You need to perform classification. In the simplest case, you'd threshold your values at 0.5 (or at a different place if you have priors of some form), XOR it with your ground truth, and sum up all elements. CNTK does not really have a thresholding operator currently, but we could wing something by misusing ReLU.

I would, however, expect this error to necessarily remain high, especially for regions with no stones on them.

Another question: Do you even need to predict fields that are already decided? It may make sense to mask those out from the objective, so you don't waste model parameters on the easy problem of outputting its input. E.g. do an elementwise mul of your prediction with 1-input^2.

jsboige commented 8 years ago

Does the two-class representation maybe make more sense, since in the end, every field is owned by one of the two sides?

There are usually a small proportion of empty cells between colored territories by the end of a game, which are left unplayed, because actual stones don't count as victory territories, unlike surrounded empty spaces and captured stones.

Accordingly, extending frontiers without surrounding anymore empty spaces does not bring any points/advantage and playing in your own territories actually loses points to the opponent, so the game usually ends by mutual agreement when there are only losing or point-less frontier moves left to be played, or a bit earlier when there is a known number of neutralizing points-gaining moves to be exchanged, such that the final score is decided.

However, the 2 class interpretation probably ties up pretty well with a mid interval probability distribution, thus the authors' success with it. I will look into normalizing my inputs within the [0;1] interval as the author did.

What error stays at 1, training or testing error? How is the training loss progressing, getting better (then have patience), worse (then lower the LR), or staying put (then increase the LR)?

As a sanity check, can you try to replace SquareError (ol, labels) by

diff = Minus (ol, labels)
sqr = ElementTimes (diff, diff)
se = SumElements (sqr)

Here is my last setup (but as you pointed, at least err is wrong):

(...)
h1Dim = 361
       h1 = DNNImageSigmoidLayer(11, 11, cMap2, h1Dim, conv2_act, 1)
       ol = DNNLayer(h1Dim, 361, h1, 1)
       diff = Minus (ol, labels)
       sqr = ElementTimes (diff, diff)

       ce = SumElements (sqr)
       err = ErrorPrediction(labels, ol)

        # Special Nodes
        FeatureNodes = (features)
        LabelNodes = (labels)
       CriterionNodes = (ce)
       EvalNodes = (err)
       OutputNodes = (ol)

and now here's what I get for epoch 1 (note I cropped training data to 450 records):

Starting Epoch 1: learning rate per sample = 0.002000  effective momentum = 0.000000  momentum as time constant = 0.0 samples
UCIFastReader: Starting at epoch 0, counting lines to determine record count...
 450 records found.
starting epoch 0 at record count 0, and file position 0
already there from last epoch

Starting minibatch loop.
RandomOrdering: 75 retries for 450 elements (16.7%) to ensure window condition
RandomOrdering: recached sequence for seed 0: 37, 27, ...
Finished Epoch[ 1 of 5]: [Training Set] TrainLossPerSample = 1.#INF; TotalSamplesSeen = 450; EvalErrPerSample = 1; AvgLearningRatePerSample = 0.0020000001; EpochTime=26.9781
SGD: Saving checkpoint model '../Output/Models/02_Convolution.1'

Training then proceeds with a crash in epoch 2:

Starting Epoch 2: learning rate per sample = 0.002000  effective momentum = 0.000000  momentum as time constant = 0.0 samples
starting epoch 1 at record count 450, and file position 0
already there from last epoch

Starting minibatch loop.
RandomOrdering: 111 retries for 450 elements (24.7%) to ensure window condition
RandomOrdering: recached sequence for seed 1: 214, 40, ...
HasNan: NaN detected at TrainOneEpoch/UpdateWeights():  (0,0) in (361,1) matrix

About to throw exception 'ol.b LearnableParameter operation has NaNs in functionValues after parameter update.'

EXCEPTION occurred: ol.b LearnableParameter operation has NaNs in functionValues after parameter update.

[CALL STACK]
    > Microsoft::MSR::CNTK::SGD<float>::  TrainOneEpoch
    - Microsoft::MSR::CNTK::SGD<float>::  TrainOrAdaptModel
    - Microsoft::MSR::CNTK::SGD<float>::  Train
    - DoTrain<Microsoft::MSR::CNTK::ConfigParameters,float>
    - DoCommands<float>
    - wmainOldCNTKConfig
    - wmain1
    - wmain
    - __tmainCRTStartup
    - wmainCRTStartup
    - BaseThreadInitThunk
    - RtlUserThreadStart

You need to perform classification. In the simplest case, you'd threshold your values at 0.5 (or at a different place if you have priors of some form), XOR it with your ground truth, and sum up all elements.

Alright, I will look into implementing the low level operator to perform cell-wise classification for ErrorPrediction.

CNTK does not really have a thresholding operator currently, but we could wing something by misusing ReLU.

I'm not sure what you mean here, as there are Sigmoid and Tanh. Do you mean I won't be able to center my sigmoid on the [0;1] interval, which ReLu actually does? I think I have read it's important to introduce a non linearity, which ReLU doesn't on that interval, but I must have misunderstood something.

I would, however, expect this error to necessarily remain high, especially for regions with no stones on them.

That's what I would normally expect too, since there is still usually a lot of undecided in mid-game, especially in non explored regions as you correctly pointed out.

However, in the authors setup, the resulting model seemed to be pretty good at making neutral predictions in those areas, as can be seen in the corresponding images, it's even figuring out non trivial local life-or-death situation, so it seems that the induced biases from unpredictable territories cancel each other during training. A large dataset is probably important here, though.

Do you even need to predict fields that are already decided? It may make sense to mask those out from the objective, so you don't waste model parameters on the easy problem of outputting its input. E.g. do an elementwise mul of your prediction with 1-input^2.

That's a good idea, which I don't think the author implemented, though practically it's not always that simple to assess what's already decided. As a matter of fact, the lib I'm using for data set processing is pretty bad at that, for some reason, and I had to hack into the code to improve end-game territories detection, which is usually quite simple. But I suppose I could also harden mid game status evaluation with some more efforts.

But again the author has had quite surprisingly good results at actually predicting stone status, and that with a modest setup, without using much of the masks and symmetries that we've been discussing, so I would hope we can leave that kind of assessment to the model just like he did.

Even as is, his results are spectacularly close to allowing the switch away from Monte-Carlo to pure Alpha-Beta pruning with a strong heuristic function derived from the model. This is exciting !

jsboige commented 8 years ago

On that last point about status evaluation, there are actually quite a few singular situations, which could be put to good use to tweak the model.

For instance, the ladder is a simple situation, where a position at the other end of the board diagonally may influence the final status of a stone (note the change from "white disaster" to "black disaster" that a distant apparently unrelated stone makes here). The way I see it, a perfect convoluted network with layers of 3*3 "Manhattan" kernels will propagate the signal along the diagonal just like the ladder appears, before it is bounced and the status recovered on the way back.

In a way, we're looking to figure out the rules of an hypothetical multidimensional grid shaped cellular automata that completely describes the dynamics of the game (influence map, thickness map etc.), and it is no coincidence that Conway invented his game of life on a board of Go....

If such an automaton exists, which seems plausible, then I suppose that's another very strong hypothesis to reduce of learnable parameters: if we transpose that into sharing kernels between consecutive layers, each representing some kind of Kalman filter running the dynamic cellular automata from the board position.

I suppose your remarks about enforcing the 18 symmetries and chains pooling may be translated into reducing by the same amount the surface of parameters to learn. That looks like something we may want to do at some point to prune the network optimally, but I get from your remark about randomization that it may be good to let asymmetries ease convergence initially.

frankseide commented 8 years ago

The crash in the second epoch is correct, since operating on INF is likely giving you a NaN. So we need to find out why is the first epoch getting you INF? I would suggest to look at it fine-grained: Set numMBsToShowResult=1 and minibatchSize=1, and see if we see any convergence/divergence behavior, or if it is INF from the start. I have seen divergence lead to NaNs, but not to INF, so this should be something else.

Do you see the same problem with SquareError()?

Does this already use the elementwise-multiplication trick to create the Manhattan kernel?

I'm not sure what you mean here, as there are Sigmoid and Tanh.

For the actual classification, you will need to choose: 0 or 1. If the value is >0.5 then choose 1, else 0. We'd need to see how to do that.

frankseide commented 8 years ago

So for error counting, I am talking internally to see if we can implement thresholding operators. It will be 1 page of C++ code, mostly boilerplate, with only 8 or so lines doing the actual operation. The cost would mostly be writing test cases. But until this code lands, one way to at least approximate thresholding of the output of a Sigmoid is to say

(int) (a < b)  = limit_{f --> INF} Sigmoid ( (a-b) * f )

or in CNTK,

thresholded = Minus (sigmoidOutput, Constant (0.5))
decisionZ = Scale (1000000, thresholded)
decision = Sigmoid (decisionZ)

Maybe write that up as a macro that we can replace later.

To count errors against your 0/1 ground truth, you need to compute

err = SumElements (Xor (groundTruth, decision))

with

Xor (a, b) = a + b - 2 * (a .* b)

(expressed in NDL). This should be correct if we have a true thresholding, but probably also works well enough with the saturated-Sigmoid hack above.

frankseide commented 8 years ago

BTW looking at the GoCNN code again, your setup differs in the Sigmoid layers. The GoCNN model uses convolution all the way to the top, where each layer is a 2D convolution followed by ReLU except for the last one, which uses a Sigmoid instead. However, if I read the TF source code right, that last Sigmoid is not a dense layer; the Sigmoid is applied independently on a per-cell basis, where, due to the convolution, the same set of parameters is applied to every cell position.

In comparison, the CNTK DNNImageSigmoidLayer() is a dense layer that simply flattens the input into a W_H_C-size vector, and then applies a regular Sigmoid (W * input + b) to it. I believe that is not what you want. You want to keep the cell positions independent.

You can write the GoCNN model structure in NDL exactly as in TF.

Note that compared to a dense layer, this full convolution effectively increases your training data for each convolution kernel 361-fold.

jsboige commented 8 years ago

I think I managed to move on a bit with the beginning of convergence after reviewing the different points you made, and many trials and crashes. Here is what I ended up with:

I have regenerated training (2500 records) and test (292) datasets with values 0 - white / 0.5 - empty / empty / 1 - black
Training criterion was stabilized at

ce = SquareError(labels, ol)

after I could verify per your advice that it was equivalent to

diff = Minus (ol, labels)
    sqr = ElementTimes (diff, diff)
    ce = SumElements (sqr)

error count took a while not to not crash, I'm not sure yet if it's consistant:

In Macros.ndl, I introduced your approximated threshold, and the Xor functions as follows:

SaturatedMidUnitThreshold(sigmoidOutput) = [
    thresholded = Minus (sigmoidOutput, Constant (0.5))
    decisionZ = Scale (Const(1000000), thresholded)
    decision = Sigmoid (decisionZ)
]

XorOp (a, b) = [
    mySum = Plus(a , b)
    myDotProduct =  ElementTimes(a , b)
    myScaledDotProduct = Scale(Const(2), myDotProduct)
    result = Minus(mySum , myScaledDotProduct)
]

and then in the model ndl, I used:

ol2 = SaturatedMidUnitThreshold(ol) err = SumElements (XorOp (labels, ol2))

Before I could get that right, I had noticed that err = SquareError(labels, ol) would seem to work just as well with much faster computation.

For convolutions, I adopted a fixed kernel size of 3*3 with 16 channels, and updated Macros to enable zero padding in order to keep the image size 19, although I haven't considered adding the Board plane yet. In my last setup, I have 7 ConvReLU layers
About Manhattan Kernels: After cntk complained successively that the pattern should be entered as a flat vector, and that the element product between the kernel parameters and the mask implied the mask should have the same nb of channels I ended up with the mask file having:

0 1 0 1 1 1 0 1 0
0 1 0 1 1 1 0 1 0
0 1 0 1 1 1 0 1 0
(... 16 rows ...)
0 1 0 1 1 1 0 1 0
0 1 0 1 1 1 0 1 0

Also, I initially tried

ManhattanMask = ImageParameter( 3, 3, 16, init="fromFile", initFromFilePath="ManhattanKernel.txt", learningRateMultiplier=0, imageLayout=$imageLayout$)

but I had dimensions mismatched, so I ended up with

ManhattanMask = LearnableParameter( 16, 9, init="fromFile", initFromFilePath="ManhattanKernel.txt", learningRateMultiplier=0)

in Macros.ndl, I introduced:


MaskedConvReLULayer(mask, inp, outMap, inWCount, kW, kH, hStride, vStride, wScale, bValue) = [
    convW = LearnableParameter(outMap, inWCount, init="uniform", initValueScale=wScale)
    masked = ElementTimes (convW, mask)
    convB = ImageParameter(1, 1, outMap, init="fixedValue", value=bValue, imageLayout=$imageLayout$)
    conv = Convolution(masked, inp, kW, kH, outMap, hStride, vStride, zeroPadding=true, imageLayout=$imageLayout$)
    convPlusB = Plus(conv, convB);
    act = RectifiedLinear(convPlusB);
]

and I could still train (no visible impact at first sight) after replacing the first convolution with

conv1_act = MaskedConvReLULayer(ManhattanMask, features, cMap1, 9, KernelSize, KernelSize, hStride1, vStride1, 10, 1)

Trying to apply the the mask to the subsequent layers

#conv2_act = ConvReLULayer(conv1_act, cMap2, 144, KernelSize, KernelSize, hStride2, vStride2, 10, 1)
    conv2_act = MaskedConvReLULayer(ManhattanMask, conv1_act, cMap2, 144, KernelSize, KernelSize, hStride2, vStride2, 10, 1)

I had a dimension mismatch:

Input dimensions [16 x 144 {1,16}] and [16 x 9 {1,16}] are not compatible.

That makes me think I have the dimensions wrong inverted somehow. Should I transpose my mask file?

Now about the model convergence: as per your advice i applied the following:

minibatchSize = 1
        numMBsToShowResult = 1

Depending on the end layers, with

h1 = DNNImageSigmoidLayer(19, 19, 16, BoardCellNb, conv7_act, 1) or h1 = DNNSigmoidLayer(5776, BoardCellNb, conv7_act, 1) (is that equivalent?) , and with or without an additional (seems optional)

ol = DNNLayer(BoardCellNb, BoardCellNb, h1, 1)

I could observe convergence from Epoch[ 1 of 10]-Minibatch[ 1- 1, 0.04%]: SamplesSeen = 1; TrainLossPerSample = 133.12721252; EvalErr[0]PerSample = 81169.00000000; TotalTime = 1.9252s; SamplesPerSecond = 0.5

Epoch[ 1 of 10]-Minibatch[ 10- 10, 0.40%]: SamplesSeen = 1; TrainLossPerSample = 54.26940918; EvalErr[0]PerSample = 66546.50000000; TotalTime = 0.6275s; SamplesPerSecond = 1.6

that is, about 10 samples into training, and then a not much progresses with a stabilization around TrainLoss = 45 and EvalErr still at 66000. Now I'm not sure if there's still something important to change, but I will try to train for longer to see what I can get out of it, but as we discussed, it is expected that the unpredictable end-territories put a had limit of the capabilities to learn.

Finally, as per your final remark I tried to stick to the authors' end layer, and replace the ImageSigmoid or Sigmoid + DNNLayer with a non squashed convolution with 1 channel, followed by a sigmoid:

h1 =  ConvLayer(conv7_act, 1, 144, KernelSize, KernelSize, hStride2, vStride2, 10, 1)
ol = Sigmoid(h1)

where ConvLayer is defined by truncating ConvReLULayer to:

ConvLayer(inp, outMap, inWCount, kW, kH, hStride, vStride, wScale, bValue) = [
    convW = LearnableParameter(outMap, inWCount, init="uniform", initValueScale=wScale)
    convB = ImageParameter(1, 1, outMap, init="fixedValue", value=bValue, imageLayout=$imageLayout$)
    conv = Convolution(convW, inp, kW, kH, outMap, hStride, vStride, zeroPadding=true, imageLayout=$imageLayout$)
    convPlusB = Plus(conv, convB);
]

I did not manage to observe any convergence here, so there must be something important differing from the original setup.

To summarize, that's yet another significant step, but there are still many things I'm not sure are correct.

frankseide commented 8 years ago

Few things for now: You are using imageLayout=cudnn, right?

And could you paste the Validation output for the dimension error? It should be possible to do an ElementTimes of a [19 x 19 x 16] and a [19 x 19], a x1 will automatically be appended, and then it will be broadcast. If not, it's a bug.

frankseide commented 8 years ago

A minor thing: The error rate you will be measuring will be 361 x too large, since each frame is counted as 1 sample, but the error count ranges 0..361.

frankseide commented 8 years ago

Should I transpose my mask file?

What happens if you declare the LearnableParameter to be of dimension 3 x 3 and pass it

0 1 0 1 1 1 0 1 0

as the input file? Could you paste the Validation output for this learnable parameter and any output that fails to validate (or the error message)?

frankseide commented 8 years ago

I just noticed that the saturation may not work. If you build yourself, can you change that #if 1 below to #if 0? Or alternatively single-step through it with a debugger and set z to 1000000 manually and see whether exp(1000000) == INF, and INF+1 == INF and 1/INF == 0. I never checked actually.

template <class ElemType>
DECL ElemType Sigmoid(ElemType z)
{
#if 1 // BUGBUG: Numerically bad. But if I don't use this, results change.
    ElemType negElem = -z;
    ElemType e = exp_(negElem);

    return 1 / (e + 1);
#else

jsboige commented 8 years ago

I have tried a couple setups to assess performances over the test dataset, the result after 1 epoch were (epoch 2 & following bring no improvment)

with ol = DNNImageSigmoidLayer + Manhattan mask on 1st ConvReLu:

err: SumElements/Sample = 64148.292 ce: SquareError/Sample = 63.657482

with ol=DNNImageSigmoidLayer + DNNLayer on 1st ConvReLu:

err:SumElements/Sample = 64148.292 ce: SquareError/Sample = 63.657482

with ol=DNNImageSigmoidLayer , 1 convReLu instead of 7, 100 samples instead of 2500

err: SumElements/Sample = 62970.45 ce: SquareError/Sample = 45.178883

and then I tend to have squareError (eval) around squareError(training), that is around 45 most of the time and err around 63000 with similar modest setups, training set and quick convergence.

It seems training does not work much then:

Since the best way to assess how much the network learnt is probably to plot the same territory images as in the Author article. I will move into that direction. Also, I will need to review the way I produce datasets now, that is to account for the 16 symmetries, and different position times within the game. One of the main remarks in the author is his model yields better result with end-game positions, which is natural.

Now about your questions:

You are using imageLayout=cudnn, right?

I wasn't able to try the model succesfully on GPU: As per one of the earlier messages, the release binary fails when reading the training set with:

ImageParameter should have 3 parameters [width, height, numChannels]

And then I was only able to build the CPU-only version. I will look if I can get MathCuda to build by reinstalling Cuda 7 SDK, but so far I have used:

deviceId = -1
imageLayout = "legacy"

I understand that imageLayout should have to do with placing the channels at the start or the end of 3 dim vector indexing, but as this was introduced as "the switch" from GPU to CPU without more precisions about updating datasets, I did not pay too much attention. Let me know if it has a consequence on how to prepare data.

And could you paste the Validation output for the dimension error?

I've saved ManhattanKernelSimple.txt to your simple 3*3 mask, and then with

ManhattanMask = ImageParameter( 3, 3, 1, init="fromFile", initFromFilePath="ManhattanKernelSimple.txt", learningRateMultiplier=0, imageLayout=$imageLayout$)

since I'm CPU only with $imageLayout$=Legacy, I get the following exception

EXCEPTION occurred: VerifySize: expected matrix size 1 x 9, but it is 3 x 3, for ManhattanMask LearnableParameter operation.

Now, forcing

ManhattanMask = ImageParameter( 3, 3, 1, init="fromFile", initFromFilePath="ManhattanKernelSimple.txt", learningRateMultiplier=0, imageLayout=cudnn)

, with

conv1_act = MaskedConvReLULayer(ManhattanMask, features, cMap, 9, KernelSize, KernelSize, hStride, vStride, 10, 1)

where cMap = 16 , I get the following error: Validating --> conv1_act.masked = ElementTimes (conv1_act.convW, ManhattanMask) : [16 x 9 {1,16}], [3 x 3 x 1 {1,3,9}] -> [16 x 9 x 1 {1,16,144}] FAILED

Now switching the file to a single row 0 1 0 1 1 1 0 1 0 with imageLayout=$imageLayout$ I get:

EXCEPTION occurred: conv1_act.masked ElementTimes operation: Input dimensions [16 x 9 {1,16}] and [1 x 3 x 3 {1,1,3}] are not compatible.

and then with 16 rows and ManhattanMask = ImageParameter( 3, 3, 16, init="fromFile", initFromFilePath="ManhattanKernel.txt", learningRateMultiplier=0, imageLayout=$imageLayout$)

Exception becomes

EXCEPTION occurred: conv1_act.masked ElementTimes operation: Input dimensions [16 x 9 {1,16}] and [16 x 3 x 3 {1,16,48}] are not compatible

So that's how I ended up with ManhattanMask = LearnableParameter( 16, 9, init="fromFile", initFromFilePath="ManhattanKernel.txt", learningRateMultiplier=0)

which did sort out the dimension exception but only for the 1st convolution, so I guess there is something wrong here.

But maybe Manhattan kernels are a refinement for later, since the author had good results without them. After the model starts learning a bit, Marking the Board's edge is probably more important at this point, so I should look at adding that feature. Could I introduce a constant matrix for that?

Finally I looked at the Sigmoid function as per your remark:

can you change that #if 1 below to #if 0?

That did not seem to make any difference that I saw

set z to 1000000 manually and see whether exp(1000000) == INF, and INF+1 == INF and 1/INF == 0.

Setting z to -1000000 (there's another negation here) I get:

e = 1.#INF0000  
e+1 = 1.#INF0000  
1/e = 0.000000000

jsboige commented 8 years ago

A minor thing: The error rate you will be measuring will be 361 x too large, since each frame is counted as 1 sample, but the error count ranges 0..361

I'm still struggling a bit to figure out whether what I have is a start, and I need to refine data and networks, or I got the setup wrong, and I should fix that first before moving anywhere.

frankseide commented 8 years ago

Thanks! So a few things we should work through.

I would suggest to get GPU and cudnn to work. The legacy format is just that, legacy. But as a start, would you mind posting the entire log, so that I can verify the dimensions for you?

Secondly, instead of ManhattanMask = ImageParameter( 3, 3, 1, init="fromFile", ..., could you try an alternative where you don't specify that last '1', but instead please use a regular Parameter(3, 3, init="fromFile", .... That will give you an actual 2D matrix. Then it should accept a 2D input file (3 lines of 3 numbers each), and broadcasting will still interpret it as a 3 x 3 x 16.

But the big problem in your setup is that the DNNImageSigmoidLayer is not what you want. That layer does this:

reinterpret an image as a flat vector
apply a regular Sigmoid layer to that flat vector It does not apply a Sigmoid to all pixels of a 2D image, although that would be a reasonable interpretation of the name as well.

DNNImageSigmoidLayer looses the location relationship (or would have to learn to keep its matrix somewhat sparse/diagonal-ish). I think GoCNN does the correct thing, which is to just apply Sigmoid function spatially to a 2D convolution output. That will maintain the 2D structure.

I think to succeed, we will need to get the spatial Sigmoid to work. You already tried that but did not get convergence. So I suggest to focus on getting it to converge. One thing that came to my mind is this:

your criterion is really 361 independent criteria
when back-propagating, every sample really back-propagates 361 criterion gradients into every single filter parameter
so this is really as if your minibatch size was 361 times larger. You have an effective minibatch size of 25 * 361 = 9025, which is huge. I normally start with 128. Too large minibatches lead to too large steps and divergence, especially early on during the training. This could explain why the original Softmax configuration you had did converge (it only propagated a single gradient instead of 361), while the spatial Sigmoid one does not: it just takes too many updates at once. If this conjecture is true, the correct solution would be to set minibatch size to 1 (you said you tried that, but also for this config?), or even randomly sub-sample.

But for a start, I would instead try to simply adjust the learning rate down by a factor of 361 accordingly. The learningRatesPerMB parameter would end up being around 1.4e-4. Incidentally, GoCNN uses 1e-4, but I have no idea whether that number has the same interpretation as in CNTK.

So that's what I would try next.

If that does not work, then in addition to that, I would make a test where all ReLUs are replaced by Sigmoids. ReLUs are finicky to train since they are scale-preserving and can thus lead to an arbitrary scaling-shift between layers that is not observable to the objective function. It is often a matter of luck, specifically with choosing the correct initialization) values. The recent BatchNormalization technique fixes that. So if replacing them by Sigmoids works but ReLUs don't, then we should try ReLUs with BatchNormalization.

(And finally thanks for confirming that the Sigmoid function has no saturation problem, that rules out a numeric problem.)

jsboige commented 8 years ago

Thanks a lot !

I would suggest to get GPU and cudnn to work

After I could switch to GPU, I was able to get somewhere.

You were right with the learning rate being a factor with the spatial sigmoid. I was able to get around

err: SumElements/Sample = 74.838471 ce: SquareError/Sample = 24.928771

after about 2500 training samples or 10 epochs for 500 samples. At the moment, I settled for 5.e-4, because finer learning rate or larger training set does not seem to make much of a difference, but now there's plenty to explore from the ce = 100 and err = 200 of an untrained model.

Now, all of this worked only when I reduced the network to max 2 hidden ConvReLu + 1 ol Conv / Sigmoid network. I added the ConvSigLayer function to Macro.ndl, and used it for the output layer:

ol = ConvSigLayer(conv2_act, 1, 288, KernelSize, KernelSize, hStride, vStride, 10, 1)

With 3 or more ConvReLu, convergence did hang around se = 70. Replacing ConvReLu with ConvSigLayers did as you indicated permit adding more hidden layers, and, it even permitted to trigger convergence with 3 or more ConvReLu as initial layers, although all test evaluation results did not prove better than the modest 2 ConvReLu + 1 output ConvSigLayer network on variations with several additional layers ConvReLuLayer and ConvSigLayer.

Now I feel the need to look at what the model learnt, and get a better grasp of what se and err are. Using the same GUI as the authors, I was able to use the same Go protocol (GTP) command to generate a mock influence map of the same kind that the article's images we discussed before. I will work on evaluating the model to produce the same kind of maps.

BTW, I will probably have to run the model on a CPU-only environment at some point. I noticed when I switched back to CPU-training, that the model would get back to the 50 something square error, and I earlier experimented that batch Normalization was not an option for CPU. Apart from that, will I be able to run the model on CPU?

I guess there might be some limitations to what the model can learn in my training strategy at this point, so I need to review that, and the influence maps will probably help, especially concerning the unpredictable territories at mid game positions.
Board detection is probably going to be important as well, zero padding translates more or less to white surrounds the board, which would be a different pretty nightmarish game for black
About Kernels, I tried your suggestion:

ManhattanMask = Parameter( 3, 3, init="fromFile", initFromFilePath="ManhattanKernelSimple.txt")

would you mind posting the entire log, so that I can verify the dimensions for you?

Here's the corresponding validation log


CNTKCommandTrainBegin: train
NDLBuilder Using GPU 0
Reading UCI file ../Data/Train-0-1.txt

Creating virgin network.
Microsoft::MSR::CNTK::GPUMatrix<ElemType>::SetUniformRandomValue (GPU): creating curand object with seed 1, sizeof(ElemType)==4

Post-processing network...

3 roots:
        ce = SquareError
        err = SumElements
        ol = Sigmoid
FormNestedNetwork: WARNING: Was called twice for ce SquareError operation
FormNestedNetwork: WARNING: Was called twice for err SumElements operation
FormNestedNetwork: WARNING: Was called twice for ol Sigmoid operation

Validating network. 31 nodes to process in pass 1.

Validating --> labels = InputValue() :  -> [19 x 19 x 1 {1,19,361} x *]
Validating --> h1.convW = LearnableParameter() :  -> [1 x 288 {1,1}]
Validating --> conv2_act.convW = LearnableParameter() :  -> [32 x 288 {1,32}]
Validating --> conv1_act.convW = LearnableParameter() :  -> [32 x 9 {1,32}]
Validating --> ManhattanMask = LearnableParameter() :  -> [3 x 3 {1,3}]
Validating --> conv1_act.masked = ElementTimes (conv1_act.convW, ManhattanMask) : [32 x 9 {1,32}], [3 x 3 {1,3}] -> [32 x 9 {1,32}]
Validating --> features = InputValue() :  -> [19 x 19 x 1 {1,19,361} x *]
Validating --> conv1_act.conv = Convolution (conv1_act.masked, features) : [32 x 9 {1,32}], [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 32 {1,19,361} x *]
Validating --> conv1_act.convB = LearnableParameter() :  -> [1 x 1 x 32 {1,1,1}]
Validating --> conv1_act.convPlusB = Plus (conv1_act.conv, conv1_act.convB) : [19 x 19 x 32 {1,19,361} x *], [1 x 1 x 32 {1,1,1}] -> [19 x 19 x 32 {1,19,361} x *]
Validating --> conv1_act.act = RectifiedLinear (conv1_act.convPlusB) : [19 x 19 x 32 {1,19,361} x *] -> [19 x 19 x 32 {1,19,361} x *]
Validating --> conv2_act.conv = Convolution (conv2_act.convW, conv1_act.act) : [32 x 288 {1,32}], [19 x 19 x 32 {1,19,361} x *] -> [19 x 19 x 32 {1,19,361} x *]
Validating --> conv2_act.convB = LearnableParameter() :  -> [1 x 1 x 32 {1,1,1}]
Validating --> conv2_act.convPlusB = Plus (conv2_act.conv, conv2_act.convB) : [19 x 19 x 32 {1,19,361} x *], [1 x 1 x 32 {1,1,1}] -> [19 x 19 x 32 {1,19,361} x *]
Validating --> conv2_act.act = RectifiedLinear (conv2_act.convPlusB) : [19 x 19 x 32 {1,19,361} x *] -> [19 x 19 x 32 {1,19,361} x *]
Validating --> h1.conv = Convolution (h1.convW, conv2_act.act) : [1 x 288 {1,1}], [19 x 19 x 32 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> h1.convB = LearnableParameter() :  -> [1 x 1 x 1 {1,1,1}]
Validating --> h1.convPlusB = Plus (h1.conv, h1.convB) : [19 x 19 x 1 {1,19,361} x *], [1 x 1 x 1 {1,1,1}] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> ol = Sigmoid (h1.convPlusB) : [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> ce = SquareError (labels, ol) : [19 x 19 x 1 {1,19,361} x *], [19 x 19 x 1 {1,19,361} x *] -> [1 {1}]
Validating --> ol2.unnamed122 = LearnableParameter() :  -> [1 x 1 {1,1}]
Validating --> ol2.unnamed120 = LearnableParameter() :  -> [1 x 1 {1,1}]
Validating --> ol2.thresholded = Minus (ol, ol2.unnamed120) : [19 x 19 x 1 {1,19,361} x *], [1 x 1 {1,1}] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> ol2.decisionZ = ElementTimes (ol2.unnamed122, ol2.thresholded) : [1 x 1 {1,1}], [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> ol2.decision = Sigmoid (ol2.decisionZ) : [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> unnamed125.mySum = Plus (labels, ol2.decision) : [19 x 19 x 1 {1,19,361} x *], [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> unnamed125.unnamed126 = LearnableParameter() :  -> [1 x 1 {1,1}]
Validating --> unnamed125.myDotProduct = ElementTimes (labels, ol2.decision) : [19 x 19 x 1 {1,19,361} x *], [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> unnamed125.myScaledDotProduct = ElementTimes (unnamed125.unnamed126, unnamed125.myDotProduct) : [1 x 1 {1,1}], [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> unnamed125.result = Minus (unnamed125.mySum, unnamed125.myScaledDotProduct) : [19 x 19 x 1 {1,19,361} x *], [19 x 19 x 1 {1,19,361} x *] -> [19 x 19 x 1 {1,19,361} x *]
Validating --> err = SumElements (unnamed125.result) : [19 x 19 x 1 {1,19,361} x *] -> [1 {1}]

Validating network. 19 nodes to process in pass 2.

Validating network, final pass.

About to throw exception 'Node 'conv1_act.masked' (ElementTimes operation): Input dimensions [32 x 9 {1,32}] and [3 x 3 {1,3}] are not compatible.'
Validating --> conv1_act.masked = ElementTimes (conv1_act.convW, ManhattanMask) : [32 x 9 {1,32}], [3 x 3 {1,3}] -> [32 x 9 {1,32}] FAILED

EXCEPTION occurred: Node 'conv1_act.masked' (ElementTimes operation): Input dimensions [32 x 9 {1,32}] and [3 x 3 {1,3}] are not compatible.

[CALL STACK]
    > Microsoft::MSR::CNTK::ComputationNodeBase::  ValidateBinaryZip
    - Microsoft::MSR::CNTK::BinaryElementWiseNode<float>::  Validate
    - Microsoft::MSR::CNTK::ComputationNetwork::  ValidateNode
    - Microsoft::MSR::CNTK::ComputationNetwork::  ValidateNodes
    - Microsoft::MSR::CNTK::ComputationNetwork::  ValidateNetwork
    - Microsoft::MSR::CNTK::ComputationNetwork::  CompileNetwork
    - Microsoft::MSR::CNTK::NDLBuilder<float>::  LoadFromConfig
    - Microsoft::MSR::CNTK::NDLBuilder<float>::  LoadNetworkFromConfig
    - Microsoft::MSR::CNTK::NDLBuilder<float>::  BuildNetworkFromDescription
    - <lambda_e42f6aaad8cf77c71d079b9b0ad0f8de>::  operator  ()
    - std::_Callable_obj<<lambda_e42f6aaad8cf77c71d079b9b0ad0f8de>,0>::_ApplyX<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNetwork>,int>
    - std::_Func_impl<std::_Callable_obj<<lambda_e42f6aaad8cf77c71d079b9b0ad0f8de>,0>,std::allocator<std::_Func_class<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNetwork>,int>>,std::shared_ptr<Microsoft::MSR::CNTK::ComputationNetwork>,int>::  _Do_call
    - std::_Func_class<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNetwork>,int>::  operator  ()
    - Microsoft::MSR::CNTK::SGD<float>::  Train
    - DoTrain<Microsoft::MSR::CNTK::ConfigParameters,float>
    - DoCommands<float>

frankseide commented 8 years ago

Excellent news! The error rate of ~75 out of 361 sounds very good. This is for the training set, right?

As for GPU/CPU, CPU should always have the same accuracy as GPU. Except that here you also have to switch imageLayout to 'legacy', right? So if the CPU mode is worse, I think it is because of different interpretation of dimensions in legacy mode.

My suggestion is to, for now, optimize your system using GPU, and take the bet on us to release a fully compatible CPU version for the "cudnn" layout. The code already exists and is currently under code review (it's in a public branch on github, I can find out the branch if this is of immediate urgency to you; but note that the final version is likely to use different NDL operation names).

As part of that update, BatchNormalization will work on the CPU as well. So I would suggest to also take a bet on BatchNormalization. I cannot predict whether ReLUs are more or less correct for this task, but I would guess they will be similarly accurate but training faster with BatchNorm (which does not help that much for Sigmoids).

As for the mask, this may be the problem: conv1_act.convW = LearnableParameter() : -> [32 x 9 {1,32}]. I would expect your kernel to be [3 x 3 x 32]. Is this for legacy layout? Again, I suggest to strictly work on GPU in cudnn mode.

jsboige commented 8 years ago

Excellent news!

Indeed ! It is so much more satisfying to have the model learn something eventually. For reference here is the influence map generated by early success with 2 convReLu + 1 ConvSig / 2500 samples after a couple minutes of training (se=24.9; err=75):

2_conv_2500_samples_5 epochs_se_24 9_err_75_chainpooling That's to compare with the influence map from the author's model influence map for the same position:

original_chainpooling

The maps may look similar, but his is a lot more precise in assessing capture status and what's at stake in the open.

The error rate of ~75 out of 361 sounds very good. This is for the training set, right?

75 was for testing. Over-training has been much of an issue yet, even for small datasets.

Also, I regenerated 20000 samples and had a longer training session with 15 ConvLayers, (4 of them ReLu, the others sigmoids) over 50 epochs. The result was just about the same as with the initial small network (se=24.3; err=74):

15_conv_20000_samples_20 epochs_se_24 3_err_74_chainpooling

From the map, it is clear the larger network is just as clueless as the trivial one, concerning captures and long range interaction.

I could also verify on a "ladder" setup that territorial integration does not click, with only very local rough computations. With 11 layers and 3*3 kernels, ideally, the network should be able to radiate out 5 cells diagonally (no Manhattan kernels yet), bounce on a "ladder-breaker"s radiation if any, and back 5 again, to assess the threatened stones capture status.

As a last confirmation that the network did not learn much, I have extracted a value function from the influence map: simply the sum of all vector coords, scaled back to [-1,1] as the author did

I have plugged that function into the search logic of an existing very poor engine. Opposing the original near random heuristic, the network exhibited a similarly poor play and ended up loosing, with a very rudimentary understanding of the situation, and no real foresight on captures.

As for GPU/CPU, CPU should always have the same accuracy as GPU. Except that here you also have to switch imageLayout to 'legacy', right? So if the CPU mode is worse, I think it is because of different interpretation of dimensions in legacy mode.

My suggestion is to, for now, optimize your system using GPU, and take the bet on us to release a fully compatible CPU version for the "cudnn" layout. The code already exists and is currently under code review (it's in a public branch on github, I can find out the branch if this is of immediate urgency to you; but note that the final version is likely to use different NDL operation names).

I had quite a few head scratches with trying to run a CPU model in the target environment, so I decided to do as you suggest and optimize on GPU first. Hopefully this is all sorted soon for CPUs.

As part of that update, BatchNormalization will work on the CPU as well. So I would suggest to also take a bet on BatchNormalization. I cannot predict whether ReLUs are more or less correct for this task, but I would guess they will be similarly accurate but training faster with BatchNorm (which does not help that much for Sigmoids).

I replaced ConvReLus with ConvBNReLULayer, and indeed I was able to get the network to converge without a limit on the number of ReLUs it seems, although the performance obtained were quite pour.

As for the mask, this may be the problem: conv1_act.convW = LearnableParameter() : -> [32 x 9 {1,32}]. I would expect your kernel to be [3 x 3 x 32]. Is this for legacy layout? Again, I suggest to strictly work on GPU in cudnn mode.

I have been using "cudnn" mode since I could work with GPU, with:

ManhattanMask = Parameter( 3, 3, init="fromFile",  initFromFilePath="ManhattanKernelSimple.txt")
features = ImageInput(BoardSize, BoardSize, 1, imageLayout=$imageLayout$)
(...)
 cMap = 32
    hStride = 1
    vStride = 1

     scValue = 1
    expAvg = 1

    convWScale = 10
    convBValue = 1
(...)
conv1_act = ConvBNReLULayer(features, cMap, 9, KernelSize, KernelSize, hStride, vStride, convWScale, convBValue, scValue, expAvg)
    #conv1_act = MaskedConvReLULayer(ManhattanMask, features, cMap, 9, KernelSize, KernelSize, hStride, vStride, convWScale, convBValue)

This works, with the unmasked version and the masked version triggers the dimension error. My training file has sample rows with 361 floats for label followed by 361 floats for feature.

I'm not sure what's wrong here.

Anyway the network hasn't learnt much for now, and I think my next steps are :

review the training set, make sure there is indeed more to learn here, with precise label maps and adding situations where the network can learn captures (there are usually quite few capture situations in mid game, so end games assessments should be part of the training set),
add feature planes for edges / liberties / chains.

frankseide commented 8 years ago

Very nice!

One other thing I would suggest is to review how your network differs from the author's. At this point, I don't expect a software bug to be the root of the problem, so CNTK and TF should both be equally able to learn this model. There may be a little detail that makes a big difference. Did you already do the border flags?

Another thing, the time-constant parameter to the BN node may be suboptimal. I am checking with my colleague.

Other than that, I would look at number of training samples and convergence-affecting parameters such as minibatch size, learning rate, parameters like momentum and AdaGrad etc.

frankseide commented 8 years ago

A CPU implementation of "cudnn" layout is available now in latest master source code, including Convolution, Pooling, and BatchNorm. Cf. Issue #161.

Note that the CPU code is not very performant, so for training better keep using the GPU if you can. But this gives you the option if switching any time, e.g. between a development machine and a server farm for really running it, expecting the same results.

jsboige commented 8 years ago

Thanks again for all your advises and walking me through this. I had to get off this for some time, but I'll make sure to match the author's setup and then experiment some more, especially with BNs.

I suppose also I should learn about the other model-loading scripting language at some point as this probably needs incremental learning.

Having the CPU work, is nice, since I'm planning to run the model as a system agnostic web API service. But being able to train on a GPU is nice. Do you think reintroducing non single mini batches could make a better use of the GPUs capabilities?

Also, I've got 2 GPUs on my environment, should I look into distributing training on the 2 nodes?

Anyway, it's not that I lack computation time at that stage, so I guess that's not a big issue.

Cheers

frankseide commented 8 years ago

Cool!

Using minibatches > 1 should help, but will be limited by convergence properties (not by software). I would tune that manually. I.e. start the training with 1, 2, 4, etc. and see which ones converge, and which don't. You will see that quite early in the training. Then choose the largest that does not seem to hurt. Then, after one data pass, try larger ones again, and repeat until you get a feel for it. For later epochs maybe no need to do this every epoch. But generally everywhere you cut the learning rate, you want to increase the minibatch size (increasing it by the same factor that you cut the learning rate with should be safe). This sounds like a lot of work, but you probably do this only once. Later experiments with model variations can likely just use the same MBSize profile, or at worst, a constant factor of it.

If you have 2 GPUs, you can use them. 1-bit SGD should only be enabled after the model is out of the worst initialization. In speech, I switch it on after training 24h hours (where our total corpus is maybe 4000h). So even if during this initial warm-start period I don't get parallelization gains, for later epochs you can get them. As an alternative, you can use 2 GPUs during the warm start, but disable 1-bit quantization. For non-convolutional networks that gives you a bandwidth bottleneck, but convolutional nets use parameters more often and thus may have less of a bandwidth problem. So that's also one thing to try.

However, your case has another complication. For a minibatch size of 1, you cannot really split that minibatch onto two machines. Parallelization works by distributing entire samples, so you need at least 2. So if your training works with MBsize 2 from the start, you can try parallelization on 2 GPUs with gradientBits=32, and later try gradientBits=1 & see if it makes things faster for you or not.

jsboige commented 8 years ago

Thanks again for your explanation of how to handle mini batches and multi GPUs, I'll follow your recommendations. Since I will have to get off this for another month or so, I just wanted to give a little update of where I'm going.

I have refactored the code that builds the training set to use a stronger "game-finishing" program, and I checked that the resulting samples are now at least as clean as the author's. A modest training suggests that the model indeed learns a little better, and as you suggested, there should not be any major hurdle preventing the reproduction of the original results.

Now to give a broader picture, the whole setup served as a workshop for an AI course I have been giving in French and in a .Net environment.

The course covers the infamous AIMA text book, the Java code of which I ported with IKVM, and it is centered on a multipurpose ASP.Net agent platform that I have been developing, which runs on the DNN content management system, and the latest release of which includes that Go setup we went through.

As a matter of fact, I mentioned our conversation in the release notes, I hope you don't mind, and the objective of running the models from the plateform's agents in varying hosted environment is the reason CPU support is important for me.

I'm planning to move ahead with the course and the platform, with extensive use of CNTK and Infer.Net, and I submitted a session for the incoming DNN international event, in an effort to introduce that open source community to AI techniques.

I'll keep you informed of my progresses, and let me know if you want more details about those.

Cheers

navidnadery commented 8 years ago

I Can't find branch dongyu/UCIFastReaderFix. Because i have the same problem. I get this error : "required parameter missing: 02_Convolution.cntk:train:reader:labels:labelMappingFile" I have 127 dim for input vector and 26 dim for output vector. Please any body helps me?

dongyu888 commented 8 years ago

That branch has been deleted since the change has been merged into master long time ago. So please use the current master instead.

navidnadery commented 8 years ago

Thanks @dongyu888 for your help. Just another question, i want to apply the convolution network with the dimensions i mentioned in my previous comment, so if i build and run current master , my problem is solved? because now when i use CNTKTextFormatReader i run to this error: "NotifyFunctionValuesMBSizeModified: features InputValue operation had its row dimension 784 changed by the reader to 127." and when i use UCIFastReader i run to labelMappingFile error.

Here is my files #721 , i have changed the config file and ndl files according to my dimension.