Make training with CuDNN enabled deterministic

nicjac commented 7 years ago

Dear all,

I spent most of my day trying to find a solution to this issue but to no available. I am hoping that some of you might be able to assist me.

I am trying to train networks in a deterministic way. This has multiple advantages, especially in situation when reproducibility is essential, such as in commercial software or competition (e.g. Kaggle). When using Matconvnet, I have noticed that there can be significant deviations between networks trained in exactly the same settings and identical data. This seems to be particularly true at inference time, possibly because of the batch normalization layers?

Anyway, the only way I have found to reproducibly train networks on a GPU is to completely disable CuDNN (by passing the "NoCudnn" options to all layers that accept it) while fixing the seed for both the CPU and GPU. However, this leads to a significant decrease in performance.

According to the CuDNN documentation, it is possible to force the use of deterministic algorithms (albeit at the cost of performance). Based on that, and on a Theanos thread , I have attempted to patch "nnconv_cudnn.cu" with instructions to use the correct, determistic algorithms. While I do notice a performance hit, the results are still not deterministic and vary every time I start the training again.

Any help would be greatly appreciated, I am sure I am close but I am probably missing something obvious!

Cheers, Nicolas

albanie commented 7 years ago

Hi @nicjac - sorry for the slow response. Will look into it. Did you make any headway in your investigations?

nicjac commented 7 years ago

@albanie sadly not, did not manage to find a solution to this issue! I have to disable CuDNN while training all my networks until it is resolved.

vlfeat / matconvnet

Make training with CuDNN enabled deterministic #778