jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

Error in CuDNN: CUDNN_STATUS_ALLOC_FAILED #116

Closed MoritzLost closed 2 years ago

MoritzLost commented 8 years ago

So I just returned from the holidays and updated both the neural-style repo as well as it's dependencies. Now I get some error messages that weren't there before ...

Before the holidays, I could process images with up to -image_size 900. Now, if I run the following query:

th neural_style.lua -gpu 0 -backend cudnn -print_iter 50 -save_iter 50 -num_iterations 1500
-content_image "content.jpg" -style_image "style.jpg" -content_weight 2e0 -style_weight 5e2
-image_size 900 -cudnn_autotune

with the new -cudnn_autotune flag, I get this error message (cutting off the beginning for readability):

Setting up style layer      2   :   relu1_1 
Setting up style layer      7   :   relu2_1 
Setting up style layer      12  :   relu3_1 
/home/gin/torch/install/bin/luajit: /home/gin/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_ALLOC_FAILED
stack traceback:
    [C]: in function 'error'
    /home/gin/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:174: in function 'createIODescriptors'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:337: in function 'updateOutput'
    /home/gin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    neural_style.lua:204: in function 'main'
    neural_style.lua:499: in main chunk
    [C]: in function 'dofile'
    .../gin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405d70

If I run the same query without the -cudnn_autotune flag, I get an out of memory-error. I've tried reducing the output image size; at -image_size 700, it works as expected both with and without the cudnn_autotune flag. Is there anything I can do to make this work (again) for larger output images? For example, does the size of the content-image and style-image matter? Or is there anything else that might have caused my machine not to be able to process images with -image_size 900 anymore?

I have a GTX 970, I'm using the CUDNN-backend with CUDA 7.5, cutorch and cunn installed as well as CuDNN 3.0.

Thanks!

Edit: I've tried a query from my bash history that worked perfectly fine before the holidays and the updates. Now it stops after Setting up style layer 12 : relu3_1 with an out of memory-error. So it's not just an issue of different stlye- and content-images and/or settings ...

monik3r commented 8 years ago

You don't have enough vram sadly. Also depending on the amount of detail in each image it may inflate vram needs. 900 pixels on a 970 is REALLY good in my experience.

MoritzLost commented 8 years ago

You are right, 2 GB VRAM isn't that much for this application. However, that doesn't explain why after the updates, executing a query that worked before now fails, even though I'm using the exact same parameters and images as before ...

rouniuyizu commented 8 years ago

I've got the same problem, but it works if I remove the autotune flag. Similar environment -- W541 laptop with only 2GB VGRAM, Ubuntu 15.10, CUDA 7.5 + CUDNN v4 (v5 doesn't work). Hope that helps.

QWERTYman2020 commented 8 years ago

so, i'm new to ubuntu and linux in general. i have been running some tests with this software for 2 weeks and i have stumbled across a similar problem.

user@user-B85M-D3H:~/neural-style$ time th neural_style.lua -content_image /home/user/Documents/srcx.png -style_image /home/bart2/Documents/style6.jpg -image_size 620 -gpu 0 -num_iterations 2000 -optimizer adam -backend cudnn -cudnn_autotune -seed 666 -save_iter 0 -output_image out8.png
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded models/VGG_ILSVRC_19_layers.caffemodel
/home/user/torch/install/bin/luajit: /home/user/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 35 module of nn.Sequential:
/home/user/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_ALLOC_FAILED (cudnnFindConvolutionForwardAlgorithm)
stack traceback:
    [C]: in function 'error'
    /home/user/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:185: in function 'createIODescriptors'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:360: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:357>
    [C]: in function 'xpcall'
    /home/user/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/user/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    neural_style.lua:204: in function 'main'
    neural_style.lua:500: in main chunk
    [C]: in function 'dofile'
    ...art2/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    /home/user/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    /home/user/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    neural_style.lua:204: in function 'main'
    neural_style.lua:500: in main chunk
    [C]: in function 'dofile'
    ...art2/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

i temporarly fixed this by running:

export LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:/home/bart2/torch/install/lib:
source ~/.bashrc

then torch would work until i booted firefox. i don't know if this was the correct solution but it worked for me.

manupillai308 commented 5 years ago

Just now got into a similar error. My kernel was dieing before which this error code pops in the cmd. I solved it by closing all other jupyter notebooks (as one was already open) and then it worked. Try closing all other notebooks which are using gpu (simply, in which, you have imported tensorflow).

chrober24 commented 4 years ago

I am receiving a similar error training a pix2pix model. hardware: 1080ti in slot 1 2080ti in slot 2 msi tomahawk ac x299 mobo (both pcie x16 slots)

I only get the allocation error when trying to use the 2080 ti in the second pcie slot on my motherboard. I need it there for thermal reasons (the 1080 ti overheats with the 2080ti above it). I have tried with just the 2080ti installed (this succeeded) as well as using cuda_visible_devices to select only the 2080ti when both were installed (this caused the error). Is there some hardware limitation with allocating to the second pcie device?