Different results from run to tun

amirbegan commented 8 years ago

I am getting slightly different results from one train run to the next. I would expect the output to remain the same (because there should either be no randomization or if there is some randomization I would hope it happens with the same seed).

This is the setup (dataset is 1,000 records, so minibatch size is set to the full dataset): Playground_Train = [ action = "train"

SimpleNetworkBuilder = [
    layerSizes = 50:50:1
    trainingCriterion = "squareError"
    evalCriterion = "ErrorPrediction"
    layerTypes = "RectifiedLinear"
]

SGD = [
    epochSize = 0
    minibatchSize = 1000
    learningRatesPerMB = 20
    momentumPerMB = 0.9
    maxEpochs = 20000
]

reader = [
    readerType = "UCIFastReader"
    file = "$DataDir$/$TrainSet$"
    miniBatchMode = "partial"
    randomize = "none"
    verbosity = 0  

    features = [
        dim = 50
        start = 0
    ]

    labels = [
        labelType = "regression"
        dim = 1
        start = 50
    ]
]

]

On one run of the above I got: Finished Epoch[ 7 of 20000]: [Training Set] TrainLossPerSample = 6.3630738e-005;...

On another run I got: Finished Epoch[ 7 of 20000]: [Training Set] TrainLossPerSample = 6.3630723e-005;...

There are more of these small differences throughout (but not after every epoch).

Is this expected behavior? Is there something I could do to get identical results on different runs?

dongyu888 commented 8 years ago

Are you running on different devices (CPU vs. GPU)? If that’s the case then the small difference is expected since the random number generation in these devices are different and the math functions are also implemented differently (means floating point operations may be carried out in different orders and thus generate different results).

If you run the same setup using same binary on the same device from clean start up using the same CUDA library (same version) the results should be the same.

amirbegan commented 8 years ago

Using GPU I get differing results from run to run (as described). I run things using the same setup and using the same binaries.

I've tried running on CPU, and on the CPU I get identical results across runs.

After 20,000 epochs the difference is: run 1: Finished Epoch[20000 of 20000]: [Training Set] TrainLossPerSample = 4.7848891e-012;...

run 2: Finished Epoch[20000 of 20000]: [Training Set] TrainLossPerSample = 5.3680311e-012;...

dongyu888 commented 8 years ago

Thanks. This is weird. It was not the case in the past. We will check to see why this would happen.

veikkoeeva commented 8 years ago

I'm just a curious lurker, but would someone be kind enough to put a link here where are these algorithms defined in the code?

Maybe competely off, but could it be due to floating point calculations? This might be of interest to other lurkers here, so I'll link a few nice source about this:

Floating-Point Determinism
Floating Point Determinism (Gaffer) Both of the links are about games as they need to deal with distributed floating point operations and with a lot of indeterminism. Good observations. Some in either of the links may give pointers why there might be problems in @amirbegan's code in the first place.

Specifically on GPU code I'd expect problems with summation etc. (cf. Kahan summation), but I suppose these are taken care of the library already.

frankseide commented 8 years ago

Note that there might be a residual non-determinism in GPU reduction operations, where multiple thread blocks compute a partial sum which must then be aggregated across blocks. I believe the order of execution of those thread blocks is not entirely deterministic (that's why cuDNN implements various versions of backprop through convolution and explicitly declares some as non-deterministic). I have not actually seen it myself, but it may depend on the specific CUDA hardware architecture, and maybe even on driver interactions/OS-side non-determinism.

The non-determinism will only affect the summation order of float values, which can lead to minor differences. The difference you showed would be in this range.

I am also not super-worried about the seemingly large relative error at the end. You started from e-5, and now you are at e-12, for a loss function consisting of a difference of two values where one is determined by the neural network being trained. E.g. see [https://en.wikipedia.org/wiki/Loss_of_significance].

You sometimes see similar differences when running Release vs. Debug, where in Release, floating-point operations are sometimes reordered.

amirbegan commented 8 years ago

Thanks for the answers.

frankseide commented 8 years ago

Just to close this off: The non-determinism in GPU reductions has been eliminated for all tensor operations as of last week, assuming you run on the same build config and GPU type.

(Non-determinism now only exists for convolution operations, which rely on NVidia code which we allow to select a non-deterministic summation-order version if it is faster.)

microsoft / CNTK

Different results from run to tun #361