ibcn-cloudlet / dianne

DIANNE - DIstributed Artificial Neural NEtworks
http://dianne.intec.ugent.be
GNU Affero General Public License v3.0
22 stars 8 forks source link

CUDNN not supporting all modules #7

Closed maxclaey closed 7 years ago

maxclaey commented 7 years ago

CUDNN does appear to not support some modules in DIANNE. When I try to deploy a network with a ConvLayer, errors occur, including: Caused by: be.iminds.iot.dianne.api.nn.module.ModuleException: Error in forward of module be.iminds.iot.dianne.nn.module.regularization.BatchNormalization c871f226-772f-3b96-bd4e-6537ff345132: CUDNN_STATUS_NOT_SUPPORTED at be.iminds.iot.dianne.api.nn.module.AbstractModule.forward(AbstractModule.java:191) at be.iminds.iot.dianne.api.nn.module.AbstractModule.forward(AbstractModule.java:143) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at be.iminds.aiolos.proxy.ServiceProxy.invoke(ServiceProxy.java:198) at com.sun.proxy.$Proxy18.forward(Unknown Source) ... 4 more Caused by: java.lang.Exception: CUDNN_STATUS_NOT_SUPPORTED at be.iminds.iot.dianne.tensor.ModuleOps.batchnorm(Native Method) at be.iminds.iot.dianne.nn.module.regularization.BatchNormalization.forward(BatchNormalization.java:136) at be.iminds.iot.dianne.api.nn.module.AbstractModule.forward(AbstractModule.java:187) ... 10 more Just after the launching the learn job, the following output occurs: CUDNN_STATUS_NOT_SUPPORTED CudnnModuleOps.c:340 Error during learning

When looking at the specified line, I think cudnnBatchNormalizationForwardInference is not supported by CUDNN. When using a simple network with only fully connected layers and some ReLUs, everything works fine using CUDNN.

tverbele commented 7 years ago

Which version of CUDA/CUDNN do you have installed?

maxclaey commented 7 years ago

We are using CUDA 8.0, CUDNN 5.1. I had tried using CUDNN 6.0 before, but I had issues there that some signatures changed..

tverbele commented 7 years ago

Can you share the neural network? We have a JUnit test for BatchNorm so the build should fail on cudnn if there is a problem with the module...

maxclaey commented 7 years ago

Unfortunately, I don't recall the exact configuration of the net as I was just playing around a bit, but I'm quite sure it was a rather minimal network with not a lot more than a ConvLayer. It was already when at the deployment step that thing went wrong.

tverbele commented 7 years ago

Probably there was an error configuring the BatchNorm module wrt the input dimensions it got, since all JUnit tests keep working... Reopen if the issue persists and if you can share a neural network description.