jcjohnson / fast-neural-style

Feedforward style transfer
4.28k stars 815 forks source link

Error when trying to train new style model (cudnnFindConvolutionBackwardDataAlgorithm failed: 2) #93

Open kanilptl opened 7 years ago

kanilptl commented 7 years ago

Hi

I can run the webcam demo code but just cant train a new style I am trying to train a new style model and I am getting the following error: (I suspect that it might be because of my cudnn version which is 7.5 but not sure)

cudnnFindConvolutionBackwardDataAlgorithm failed: 2 convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA4,32,256,256 -filtA3,32,9,9 4,3,256,256 -padA4,4 -convStrideA1,1 CUDNN_DATA_FLOAT
/home/user/Documents/torch/install/bin/luajit: ...l/Documents/torch/install/share/lua/5.1/nn/Container.lua:67: In 22 module of nn.Sequential: ...nil/Documents/torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionBackwardDataAlgorithm failed, sizes: convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA4,32,256,256 -filtA3,32,9,9 4,3,256,256 -padA4,4 -convStrideA1,1 CUDNN_DATA_FLOAT stack traceback: [C]: in function 'error' ...nil/Documents/torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'backwardDataAlgorithm' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:222: in function 'updateGradInput' ...anil/Documents/torch/install/share/lua/5.1/nn/Module.lua:31: in function <...anil/Documents/torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' ...l/Documents/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' .../Documents/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' train.lua:211: in function 'opfunc' ...nil/Documents/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' train.lua:239: in function 'main' train.lua:327: in main chunk [C]: in function 'dofile' ...ents/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' ...l/Documents/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' .../Documents/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' train.lua:211: in function 'opfunc' ...nil/Documents/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' train.lua:239: in function 'main' train.lua:327: in main chunk [C]: in function 'dofile' ...ents/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

xyddj commented 7 years ago

Hello, have you solved this issue? I had the same problem after re-installing torch.

kanilptl commented 7 years ago

It worked for me after I installed it on another machine with the correct versions. So I suspect it is problem with different versions of cuda and theano. Hope that helost.

Please let me know if you get a solution for it.

cddlyf commented 7 years ago

@silverbird43852 @xyddj have you solved this issue?

sebi-ursulescu commented 7 years ago

I'm having the exact same problem.

cudnnFindConvolutionBackwardDataAlgorithm failed: 2 convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA3,64,168,168 -filtA128,64,3,3 3,128,84,84 -padA1,1 -convStrideA2,2 CUDNN_DATA_FLOAT
/home/arturo/torch/install/bin/luajit: /home/arturo/torch/install/share/lua/5.1/nn/Container.lua:67: In 8 module of nn.Sequential: /home/arturo/torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionBackwardDataAlgorithm failed, sizes: convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA3,64,168,168 -filtA128,64,3,3 3,128,84,84 -padA1,1 -convStrideA2,2 CUDNN_DATA_FLOAT stack traceback: [C]: in function 'error' /home/arturo/torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'backwardDataAlgorithm' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:222: in function 'updateGradInput' /home/arturo/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/arturo/torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' /home/arturo/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/arturo/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' train.lua:211: in function 'opfunc' /home/arturo/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' train.lua:239: in function 'main' train.lua:327: in main chunk [C]: in function 'dofile' ...turo/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:151: in main chunk [C]: at 0x00405d50

Did anyone find out what is happening?

cddlyf commented 7 years ago

@Sebstyless In my case, this error happens when out-of-memory, so I guess the stack traceback info may be incorrect.

EnvnHash commented 7 years ago

I have the same issue. But enough memory, so i can't confirm this....

EnvnHash commented 7 years ago

...there other threads that talk about this issue. You should have noted that it's a about the gpu running out of memory

EnvnHash commented 7 years ago

...the answer was already here https://github.com/jcjohnson/fast-neural-style/issues/100 adding -batch_size to 2 fixes this for me.

ProGamerGov commented 6 years ago

@EnvnHash The issue may not be because of lack of memory, as I get this error well before my GPU usage is full: https://github.com/soumith/cudnn.torch/issues/384, though I get this error on Neural-Style.