microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

Regression Example #699

Closed cprschmid closed 8 years ago

cprschmid commented 8 years ago

I have read the documentation looking for an example on how to setup (reader, network configuration) a regression problem (e.g., predict housing prices from various inputs). However, the (reader) examples describe only classification or sequence learning tasks.

I there any information anywhere that I could use to to get started with a regression task?

My main 2 questions are:

  1. How do I specify the regression (output) target value using the CNTKTextFormatReader (in the deprecated UCI Fast Reader I could specify the value on the same line)?
  2. What is the recommended output layer configuration? Softmax works great for classification but not for regression (same for Sigmoid).
mfuntowicz commented 8 years ago

Hi,

You can find an example of a regression task in the Wiki All your questions are addressed inside :)

Let us know if you need more assistance. Hope it helps, Morgan

cprschmid commented 8 years ago

I didn't mean Logistic Regression as the model algorithm, but rather a regression task where the goal is to predict a value (e.g., housing prices, stock prices, etc.) rather than classifying the input image or sequence learning.

The example you pointed me to says "... Because we are performing binary classification, we could set this up either as a multi-class classification problem ..."

mfuntowicz commented 8 years ago

You just need to change the output layer to a linear one:

p = w * features + b

Change the objective function in order to minimize SquareError

criterionNodes = (err)

and remove the unecessary line :

lr = Logistic (labels, p)

It'll output real values, and you'll train your network to minimize the error between the network output and the real one :)

Morgan

cprschmid commented 8 years ago

I am trying to get a simple example working. Simple because the training and the test data are exactly the same. Furthermore, the relationship between the inputs and the labels is strictly linear so it should be learned correctly:

|F 1.0 1 |L 10
|F 2.0 1 |L 20
|F 3.0 1 |L 30
|F 4.0 1 |L 40

I am using the following network definition (it is doing mean and variance normalization, but somewhere I read that CNTK will do that automatically?):


# macros to include
load = ndlDLTMacros

# the actual NDL that defines the network
run = DNN

ndlDLTMacros = [
    featDim = 2
    labelDim = 1

    features = Input(featDim)
    labels = Input(labelDim)

    # input precompute
    featMean = Mean(features)
    featInvStd = InvStdDev(features)
    featInput = PerDimMeanVarNormalization(features, featMean, featInvStd) 
]

DNN = [

    # Variables
    hiddenDim = 3

    # Layer Operations
    # DNNSigmoidLayer and DNNLayer are defined in Macros.ndl
    h1 = DNNSigmoidLayer(featDim, hiddenDim, featInput, 1)
    ol = DNNLayer(hiddenDim, labelDim, h1, 1)

    # Criterion
    sqerr = SquareError(labels, ol)

    # Eval
    ep = ErrorPrediction(labels, ol)

    # Special Nodes
    FeatureNodes = (features)
    LabelNodes = (labels)
    CriterionNodes = (sqerr)
    EvalNodes = (ep)
    OutputNodes = (ol)
]

The macros are defined as such:


DNNSigmoidLayer(inDim, outDim, x, parmScale) = [
        # Parameters
    W = Parameter(outDim, inDim, init="uniform", initValueScale=parmScale) 
    b = Parameter(outDim, 1,     init="uniform", initValueScale=parmScale) 
    # Functions
    t = Times(W, x)
    z = Plus(t, b)
    y = Sigmoid(z)
]

DNNReLULayer(inDim, outDim, x, parmScale) = [
        # Parameters
    W = Parameter(outDim, inDim, init="uniform", initValueScale=parmScale) 
    b = Parameter(outDim, 1,     init="uniform", initValueScale=parmScale) 
    # Functions
    t = Times(W, x)
    z = Plus(t, b)
    y = RectifiedLinear(z)
]

DNNLayer(inDim, outDim, x, parmScale) = [
        # Parameters
    W = Parameter(outDim, inDim, init="uniform", initValueScale=parmScale)
    b = Parameter(outDim, 1,     init="uniform", initValueScale=parmScale)
    # Functions    
    t = Times(W, x)
    z = Plus(t, b)
]

And finally the configuration file is as follows:


# Copyright (c) Microsoft. All rights reserved.
# Licensed under the MIT license. See LICENSE file in the project root for full license information.

# currentDirectory=$(SolutionDir)/<path to corresponding data folder> 
RootDir = ".."

ConfigDir = "$RootDir$/Config"
DataDir   = "$RootDir$/Data"
OutputDir = "$RootDir$/Output"
ModelDir  = "$OutputDir$/Models"

# which commands to run
command=Train:Test:Output:dumpNodeInfo

#required...
precision = "float"
modelPath="$OutputDir$/Models/simple.dnn"   # where to write the model to
ndlMacros="$ConfigDir$/Macros.ndl"

# uncomment the following line to write logs to a file 
stderr = "$OutputDir$/simple_out"
traceLevel=1

deviceId=-1                 # CPU < 0
inputDimension=2        # input data dimensions
labelDimension=1        # label dimensions

#######################################
#  TRAINING CONFIG                    #
#######################################
Train=[
    action="train"

    NDLNetworkBuilder=[
        networkDescription = "$ConfigDir$/simple.ndl"
  ]

    SGD = [ 
        epochSize=0                         # =0 means size of the training set
        minibatchSize=100
        learningRatesPerMB=0.1            # learning rates per MB
        momentumPerMB = 0           
        maxEpochs=10
    ]

    # parameter values for the reader
    reader = [
        readerType = "CNTKTextFormatReader"
        file = "Train-Simple.txt"
        randomize = "none"
        maxErrors = 100
        traceLevel = 2

        input = [
            features = [
                alias = "F"
                dim = 2
                format = "dense"
            ]       
            labels = [
                alias = "L"
                dim = 1
                format = "dense"
            ]
        ]   
    ]   
]

#######################################
#  TEST CONFIG                        #
#######################################
Test=[

    action="test"

    reader = [
        readerType = "CNTKTextFormatReader"
        file = "Test-Simple.txt"
        #skipSequenceIds = "true"               
        randomize = "none"
        maxErrors = 100
        traceLevel = 2

        input = [
            features = [
                alias = "F"
                dim = 2
                format = "dense"
            ]       
            labels = [
                alias = "L"
                dim = 1
                format = "dense"
            ]
        ]   
    ]   
]

# output the results
Output=[

    action="write"

    reader = [
        readerType = "CNTKTextFormatReader"
        file = "Test-Simple.txt"
        randomize = "none"
        maxErrors = 100
        traceLevel = 2

        input = [
            features = [
                alias = "F"
                dim = 2
                format = "dense"
            ]       
            labels = [
                alias = "L"
                dim = 1
                format = "dense"
            ]
        ]   
    ]

    outputPath="$OutputDir$/simple.output.txt"
]

dumpNodeInfo=[
  action = dumpnode
  printValues = true
  printMetadata = true
]

The first thing I note when I do the training is that the sqerr is still 600 after 10 epochs. Furthermore the predictions (in the ol.z file) are as follows:

3.338846
3.413505
3.487528
3.560431

which is not even close to the expected sequence (10, 20, 30, 40).

There might be a mismatch between the pre-processing of the data for the train vs. test commands, but I am not sure.

Is that explicit normalization necessary? Should it be done as part of the training and test set generation?

cmarschner commented 8 years ago

It seems to be doing the right thing, but your learning rate is too small. Try with a higher learning rate, or use learningRatePerSample.

cprschmid commented 8 years ago

Indeed - changing the learning rate, increasing the number of epochs eventually produced a model that successfully learned the linear relationship.

cprschmid commented 8 years ago

I moved on to a more realistic regression task: learning to predict housing prices using the well known UCI Housing Data Set. .

I am using the same model architecture (as above), adjusting it for a larger input layer (13 features), and increasing the size of the hidden layer (200 nodes). However, after only 3 epochs the training aborts with the sqerr = 1.#QNAN000 * 253 value (steadily increasing with each epoch).

What's causing the sqerr to be so big to start with and to increase steadily? The feature and the label values are all < 100. Even a network that always predicts 0 will have a training set sqerr less than a million.

frankseide commented 8 years ago

Your training is not converging. Reasons are too large learning rate, too large or small initialization values of learnable parameters, and too large minibatch size.

frankseide commented 8 years ago

There are options in the SGD configuration block that allows you to see partial objectives as they progress:

    numMBsToShowResult = 100  # show intermediate objective values every 100 minibatches
    firstMBsToShowResult = 10  # and for the first 10 minibatches

This way you should be able to see whether at the very start you are already starting out with something off.

I recommend to start with a small minibatch size, maybe 128, and a small learningRatePerSample, maybe 0.001. And then I would try different orders of magnitude of initialization parameters for the weight matrices. Normally one would think they should be close to zero with very small perturbations to break ties, but I have found that larger init values sometimes lead to better results, or at least to getting off the ground earlier.

cprschmid commented 8 years ago

@frankseide Thank you for your feedback. I am still learning how to use the different knobs (parameters) to control the learning behavior. I used some of your suggestions already and was able to get the system to learn to some extent the housing data.

frankseide commented 8 years ago

Trust me, we are all still learning about these knobs! It still has way too much elements of a black art, but sadly that's still the state of play. So please do not hesitate to ask further questions (and maybe share your experience as well if you like).

dustinandrews commented 7 years ago

I had a lot of trouble with this topic too. I've created a notebook with notes that I think will help people trying to learn function approximation. You can see the pull request here: https://github.com/Microsoft/CNTK/pull/1767/files

Amalkhairy commented 7 years ago

I am doing the exactly same thing but for nonlinear input data,
I used a 2 layer model one for the hidden layer,and i am using sigmoid activation function, and for the other layer which is output layer and i am using a leaky RELU in it. But i am struggling with the results, the putout data are all the same number which is definitely wrong, Any Help please ?