Closed maxclaey closed 7 years ago
Which version of CUDA/CUDNN do you have installed?
We are using CUDA 8.0, CUDNN 5.1. I had tried using CUDNN 6.0 before, but I had issues there that some signatures changed..
Can you share the neural network? We have a JUnit test for BatchNorm so the build should fail on cudnn if there is a problem with the module...
Unfortunately, I don't recall the exact configuration of the net as I was just playing around a bit, but I'm quite sure it was a rather minimal network with not a lot more than a ConvLayer. It was already when at the deployment step that thing went wrong.
Probably there was an error configuring the BatchNorm module wrt the input dimensions it got, since all JUnit tests keep working... Reopen if the issue persists and if you can share a neural network description.
CUDNN does appear to not support some modules in DIANNE. When I try to deploy a network with a ConvLayer, errors occur, including:
Caused by: be.iminds.iot.dianne.api.nn.module.ModuleException: Error in forward of module be.iminds.iot.dianne.nn.module.regularization.BatchNormalization c871f226-772f-3b96-bd4e-6537ff345132: CUDNN_STATUS_NOT_SUPPORTED at be.iminds.iot.dianne.api.nn.module.AbstractModule.forward(AbstractModule.java:191) at be.iminds.iot.dianne.api.nn.module.AbstractModule.forward(AbstractModule.java:143) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at be.iminds.aiolos.proxy.ServiceProxy.invoke(ServiceProxy.java:198) at com.sun.proxy.$Proxy18.forward(Unknown Source) ... 4 more Caused by: java.lang.Exception: CUDNN_STATUS_NOT_SUPPORTED at be.iminds.iot.dianne.tensor.ModuleOps.batchnorm(Native Method) at be.iminds.iot.dianne.nn.module.regularization.BatchNormalization.forward(BatchNormalization.java:136) at be.iminds.iot.dianne.api.nn.module.AbstractModule.forward(AbstractModule.java:187) ... 10 more
Just after the launching the learn job, the following output occurs:CUDNN_STATUS_NOT_SUPPORTED CudnnModuleOps.c:340 Error during learning
When looking at the specified line, I think
cudnnBatchNormalizationForwardInference
is not supported by CUDNN. When using a simple network with only fully connected layers and some ReLUs, everything works fine using CUDNN.