Pixel-wise classification - training criterion is not a number (NAN)

Birky commented 7 years ago

Hey all,

I am a newbie at CNTK and I would like try out a simple convolutional network for pixel-wise classification, but I probably doing something wrong. Can you help me find out what am I doing wrong?

I have 3D MR images of brains with tumour which I sliced up to slices of size 240x240 and I also have the ground truth segmentation of the tumours. I converted these data to a txt file according to the MNIST examples. Features as the input images and labels as the ground truth images. Input images are single channel unsigned short images, but their values are less then 2500. The ground truth images are binary images (0 - non tumour, 1 - tumour).

My cntk config file is the following:

command = trainNetwork

precision = "float"; traceLevel = 1 ; deviceId = "auto"

rootDir = ".." ; dataDir = "$rootDir$/DataSets/BRATS" ;
outputDir = "./OutputBRATS" ;

modelPath = "$outputDir$/Models/02_OneConv2"
#stderr = "$outputDir$/02_OneConv_bs_out_brats"

# TRAINING CONFIG
trainNetwork = {
    action = "train"

    makeMode = false

    BrainScriptNetworkBuilder = {
        imageShape = 240:240:1                    
        truthShape = 240:240:1                        
        featScale = 1/2500

        Scale{f} = x => Constant(f) .* x

        model = Sequential (
        ConvolutionalLayer {16, (5:5), pad=true, activation=ReLU} :
            MaxPoolingLayer {(2:2), stride=(2:2)} :     
            DenseLayer {64, activation=ReLU} :
            LinearLayer {truthShape}
        )

        # inputs
        features = Input {imageShape}
        labels = Input (truthShape)

        # apply model to features
        ol = model (features)

        # loss and error computation
        ce   = CrossEntropyWithSoftmax (labels, ol)
        errs = ClassificationError (labels, ol)

        # declare special nodes
        featureNodes    = (features)
        labelNodes      = (labels)
        criterionNodes  = (ce)
        evaluationNodes = (errs)
        outputNodes     = (ol)
    }

    SGD = {
        epochSize = 60000
        minibatchSize = 32
        maxEpochs = 15
        learningRatesPerSample = 0.001*5:0.0005
        momentumAsTimeConstant = 0
        numMBsToShowResult = 500
    }

    reader = {
        readerType = "CNTKTextFormatReader"
        file = "$DataDir$/train_flair_slices.txt"
        input = {
            features = { dim = 57600 ; format = "dense" }
            labels =   { dim = 57600 ; format = "dense" }
        }
    }   
}

however, when I start the training it ends with an exception saying:

Starting minibatch loop. Epoch[ 1 of 15]-Minibatch[ 1- 500, 26.67%]: ce = 1.#QNAN000 15820; errs = 45.386% 15820; time = 80.4506s; samplesPerSecond = 196.6

[CALL STACK]

Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=

Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=

Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=

Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=

Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=

Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=

Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=

Microsoft::MSR::CNTK::Matrix:: __autoclassinit2

BaseThreadInitThunk

RtlUserThreadStart

EXCEPTION occurred: The training criterion is not a number (NAN).

I tried out many things, but nothing really helps. I also tried to replace the CrossEntropyWithSoftmax with CrossEntropy, then I didn't get the NAN exception, but the Classification error gives me 99-100% errors through all the 15 epochs.

Am I doing the pixel-wise classification correctly by setting the inputs like the features and the ground truth like the labels?

I also tried to write out the outputs of the model (ol), but I have not enough GPU memory (4GB).

# output the results
Output = {
    action = "write"
    reader = {
        readerType = "CNTKTextFormatReader"
        file = "$DataDir$/train_flair_slices.txt"
        input = {
            features = { dim = 57600 ; format = "dense" }
            labels =   { dim = 57600 ; format = "dense" }
        }
    }   
    outputPath = "LR.txt"  # dump the output to this text file
}

Kind regards, Birky

cha-zhang commented 7 years ago

Your network is too simple and it's probably not going to work well. One thing you may try is to reduce your learning rate.

Also, you are not really doing classification, so CrossEntropyWithSoftmax may not be the right criteria. Try SquaredError and see if things are any different.

raaaar commented 7 years ago

Closing this thread as it hasn't been updated in a while.

microsoft / CNTK

Pixel-wise classification - training criterion is not a number (NAN) #1255