Error in CuDNN: CUDNN_STATUS_ALLOC_FAILED in training

rayset commented 8 years ago

Epoch 1.000000, Iteration 20695 / 40000, loss = 323682.617689 0.001 /home/ubuntu/torch/install/bin/luajit: /home/ubuntu/.luarocks/share/lua/5.1/nn/Container.lua:67: In 5 module of nn.Sequential: /home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_ALLOC_FAILED stack traceback: [C]: in function 'error' /home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186: in function 'createIODescriptors' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:364: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:361> [C]: in function 'xpcall' /home/ubuntu/.luarocks/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/ubuntu/.luarocks/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' train.lua:164: in function 'opfunc' /home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam' train.lua:240: in function 'main' train.lua:328: in main chunk [C]: in function 'dofile' ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

this happens when it goes on the validation set (40k images), as far I know the 80k training ones work fine...maybe? not sure if it goes over them first. What could this be? results look quite bad too, I guess it didn't finish its iterations.

out

my h5 file is 22+ gbs. It did fail with a strange error (something like cannot allocate/open) when it was missing 4 (four) images from the validation set.

edit: redid it again with a new h5 file that completed correctly (same size to the bit tho)...still the same error. It has some problems starting the second epoch for some reason, I read the code but did not find any clear reason.

tzatter commented 8 years ago

Hello, rayset Maybe you need more memory I have succeeded training with GTX960 2G memory Try adding this parameter "-batch_size 1"

rayset commented 8 years ago

Those are done with ec2 istances with 4 gb vram. Is your h5 file ~22 gb too?

Il 08/ott/2016 19:27, "tzatter" notifications@github.com ha scritto:

Hello, rayset Maybe you need more memory I have succeeded training with GTX960 2G memory Try adding this parameter "-batch_size 1"

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/fast-neural-style/issues/17#issuecomment-252437330, or mute the thread https://github.com/notifications/unsubscribe-auth/AI-W157QByUCi63La84Sg4OhjXjBkk3Kks5qx9JjgaJpZM4KRc-a .

tzatter commented 8 years ago

24,239,214,976 bytes I downloaded that images from here

tzatter commented 8 years ago

Bad quality happened! I have trained another image

training flags $ th train.lua \ -h5_file /path/to/fast_neural_style.h5 \ -style_image /path/to/style.jpg \ -style_image_size 384 \ -content_weights 1.0 \ -style_weights 5.0 \ -checkpoint_name checkpoint \ -gpu 0 \ -loss_network /path/to/vgg16.t7 \ -batch_size 1

... Epoch 0.483191, Iteration 40000 / 40000, loss = 78592.863735 0.001
Running on validation set ...
val loss = 122388.069840

this is style image sunrise-182302_1280

this is content image 6131

this is the result image 6131

rayset commented 8 years ago

With that batch size you should increase the iterations

Il 09/ott/2016 19:46, "tzatter" notifications@github.com ha scritto:

Bad quality happened! I have trained another image

training flags $ th train.lua \ -h5_file /path/to/fast_neural_style.h5 \ -style_image /path/to/style.jpg \ -style_image_size 384 \ -content_weights 1.0 \ -style_weights 5.0 \ -checkpoint_name checkpoint \ -gpu 0 \ -loss_network /path/to/vgg16.t7 \ -batch_size 1

... Epoch 0.483191, Iteration 40000 / 40000, loss = 78592.863735 0.001

Running on validation set ...

val loss = 122388.069840

this is style image [image: sunrise-182302_1280] https://cloud.githubusercontent.com/assets/17694190/19222418/dd37d1c4-8e92-11e6-93e6-777cd62aa40c.jpg

this is content image [image: 6131] https://cloud.githubusercontent.com/assets/17694190/19222426/0b534c6e-8e93-11e6-952d-5224de3daa5b.jpg

this is the result image [image: 6131] https://cloud.githubusercontent.com/assets/17694190/19222428/1e3ea1d4-8e93-11e6-8c65-3a7d3512f6a5.jpg

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/fast-neural-style/issues/17#issuecomment-252500894, or mute the thread https://github.com/notifications/unsubscribe-auth/AI-W13ZbpM5AKXYkDFyxdecN_iBBd-l_ks5qyShogaJpZM4KRc-a .

tzatter commented 8 years ago

Thank you! I must use new gpu of amazon ec2 p2.large

rayset commented 8 years ago

a g2 has plenty of memory to do a batch of 4 512 px images

2016-10-09 19:52 GMT+02:00 tzatter notifications@github.com:

Thank you! I must use new gpu of amazon ec2 p2.large https://aws.amazon.com/jp/blogs/aws/new-p2-instance-type-for-amazon-ec2-up-to-16-gpus/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/fast-neural-style/issues/17#issuecomment-252501247, or mute the thread https://github.com/notifications/unsubscribe-auth/AI-W1_6QraYA1hGSjDMaB74ajBC-X1zCks5qySnogaJpZM4KRc-a .

tzatter commented 8 years ago

That's right! I'm sorry I can't wait any more. It takes a long time

tzatter commented 8 years ago

Finally I understand what you are saying I met the same issue like you

$ python scripts/make_style_dataset.py \
  --train_dir ~/mount01/train2014 \
  --val_dir ~/mount01/val2014 \
  --output_file ~/mount01/fast_neural_style_mscoco_512px.h5 \
  --height 512 \
  --width 512

...

Copied 40400 / 40504 images
Copied 40500 / 40504 images
Exception in thread Thread-5 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 754, in run
  File "scripts/make_style_dataset.py", line 64, in read_worker
  File "/usr/lib/python2.7/Queue.py", line 138, in put
  File "/usr/lib/python2.7/threading.py", line 384, in notify
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

Did you solve the problem?

rayset commented 8 years ago

Not at all sadly :(

2016-10-14 10:45 GMT+02:00 tzatter notifications@github.com:

Finally I understand what you are saying I met the same issue like you

$ python scripts/make_style_dataset.py \ --train_dir ~/mount01/train2014 \ --val_dir ~/mount01/val2014 \ --output_file ~/mount01/fast_neural_style_mscoco_512px.h5 \ --height 512 \ --width 512

...

Copied 40400 / 40504 images Copied 40500 / 40504 images Exception in thread Thread-5 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner File "/usr/lib/python2.7/threading.py", line 754, in run File "scripts/make_style_dataset.py", line 64, in read_worker File "/usr/lib/python2.7/Queue.py", line 138, in put File "/usr/lib/python2.7/threading.py", line 384, in notify : 'NoneType' object is not callable

Did you solve the problem?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/fast-neural-style/issues/17#issuecomment-253743327, or mute the thread https://github.com/notifications/unsubscribe-auth/AI-W1456DL-3e9RsOLBSwucv-wJjQO1Qks5qz0EugaJpZM4KRc-a .

jcjohnson / fast-neural-style

Error in CuDNN: CUDNN_STATUS_ALLOC_FAILED in training #17