jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

Memory issues and previews? #45

Open dxmachina opened 9 years ago

dxmachina commented 9 years ago

Before anything else - this is a fantastic project!!

I've read through some previous advice re: memory usage and I'm using a 980ti (6GB VRAM) but am also battling image size limits. Using Adam and cudnn I am able to get up to a 1200px image just within my limit, but as you noted, the default Adam settings produce a noticeably inferior output. I've actually yet to find settings that sit as well (any suggestions?).

This brings up two different issues: 1. How can we work towards larger output - and possibly 2. how can we go about having a bit more visual feedback for the myriad of settings.

In terms of memory is there anything else that could be done short of buying a TitanX? Correct me if I'm wrong, but multi-GPU usage is also not supported (so I couldn't add a card, right?). Any other ideas? Any possibility of using a combined GPU/CPU mode or something along those lines?

As to the issue of having some visual feedback, I am finding myself very often running "quick" proofs using just a few iterations at a very small size (256). There are so many options to consider that I'm stabbing in the dark a bit I feel. Anyway, my suggestion (or perhaps something I should just figure out how to do myself) is to write a preview mode that perhaps does 8 small thumbnails using a variety of options that can be presented and looked through to find optimal settings. What do you think?

Thanks again for sharing this wonderful project!

jcjohnson commented 9 years ago

Another way to produce larger outputs would be to swap the VGG-19 network for something smaller, like AlexNet or CaffeNet which are both available in the Caffe model zoo and are already supported in neural-style. I did some preliminary experiments with CaffeNet, and it indeed ran much faster and used much less memory, but my results were a lot worse than VGG-19 so I've left that as the default.

CPU + GPU is probably not going to help; the CPU will be much slower than the GPU, and frequent communication between the CPU and GPU will slow everything to a crawl.

You are correct that multi-GPU is not currently supported, and would be tricky to add. For deep learning models there are two basic ways to handle multi-GPU: either data parallelism, where you split your minibatches across GPUs for each step, or model parallelism where you compute different parts of the same model on different GPUs. Data parallelism is easier implement in general, but doesn't apply for neural-style since we only have a single datapoint that we pass through the model. Model parallelism would be possible, and could be implemented using fbcunn, but I've never used it so I don't know how easy it is to use. I'm also hesitant to pull in fbcunn as a dependency, because it will transitively pull in a lot of other dependencies and make the whole project a lot harder to install.

For visual feedback, I'm not sure that your idea will work. The same parameters (content weight, etc) tend to give different results for different image sizes, so you really need to test parameters at the size you want to render. You can change the -save_iter to save intermediate results more frequently; that might help experimentation. At larger image sizes you really get bitten, because the correct parameters will be different than for smaller image sizes, each iteration takes longer, and you need a good number of iterations (at least 100) to get a sense of what it's going to look like.

dxmachina commented 9 years ago

Thanks for your prompt response. I guess there's no free lunch today. :)

How large an image are you able to get out of the Titan card?

jcjohnson commented 9 years ago

With ADAM and cuDNN I can generate 1387x1850 on the Titan X, but due to all the same problems you are encountering I haven't gotten anything that actually looks good at that resolution.

Currently I'm also driving my monitor off the Titan X, and compiz / X server are taking up about 900MB of memory; if I were to run the display off a different GPU and devote the entire Titan X to neural-style I could generate an image that was a bit bigger.

dxmachina commented 9 years ago

Thanks for that.

Have you noticed (when using ADAM) that there is a bit of a repeating odd/even pattern when viewing iterations... one seems to add small details, the next takes them away? It's interesting to see the process happening. I find that I like the smoothness of the default better - but using ADAM seems to sometimes offer a superior overall luminance.

jcjohnson commented 9 years ago

That may have to do with the TV regularization; you can try adjusting the -tv_weight parameter to change it; larger values will be more smooth and smaller values will be less smooth.

Currently the TV regularization is L2; on my TODO list is to add L1 TV regularization, which might give nicer results.

jcjohnson commented 9 years ago

I was playing with HD images tonight, and made this one at 1920x1010:

starry_stanford_hd

It took around an hour to run on the Titan X.

Here are the parameters I used if you want to use them as a starting point for large images:

th neural_style.lua \
  -content_image examples/inputs/hoovertowernight.jpg \
  -style_image examples/inputs/starry_night.jpg \
  -num_iterations 2000 \
  -image_size 1920 \
  -content_weight 5e3 \
  -style_weight 5e4 \
  -tv_weight 0 \
  -style_scale 0.3 \
  -optimizer adam \
  -backend cudnn \
  -normalize_gradients \
  -learning_rate 2e1 \

I'm not totally happy with this result - I disabled TV regularization, so there is some ugly high frequency noise; I think a slightly large style scale might also improve the results. But this is still the best I have done so far at high resolution.

dxmachina commented 9 years ago

Very interesting. Thanks for that. Do you find using ADAM with the VGG-19 superior to using LBFGS with a smaller model?

jcjohnson commented 9 years ago

I didn't spend too much time playing with smaller models, but even at low image sizes I was not happy with the results of L-BFGS with CaffeNet.

dxmachina commented 9 years ago

Can I ask you about how to get the alexnet/caffenet or VGG normalized models working?

I'm able to get the normalized model running, but the output is a solid color.

I did also pull down the other models, but am not able to pass them without error messages.

jcjohnson commented 9 years ago

If you get a solid color, your TV regularization is probably too high. Try setting -tv_weight to a smaller value.

If you want to use a model other than VGG-19 or the normalized network, you'll need to use the -proto_file flag to specify the .prototxt file and the -model_file flag to specify the .caffemodel file. The names of the layers will differ from network to network, so you'll also have to use the -content_layers and -style_layers flags to specify which layers from the network you want to use for content and style reconstruction respectively. You can find the names of the layers by looking at the .prototxt file; for example with CaffeNet you may want something like -content_layers relu1,relu2,relu3,relu4 and -style_layers relu4. I'm not sure which layers will work best for other networks, so that is something you should experiment with.

hughperkins commented 9 years ago

@dxmachina Hmmm, seems like an interesting challenge. I reckon it's always possible to find some way of training on arbitrarily-sized images, but it might be incredibly slow. But it sounds like your priority right now is generating images, even if more slowly, rather than having super fast, but cannot generate really large images. How slowly would you tolerate if there was a way of getting larger images? I know zilch about this specific project, but I assume all the memory is going into the im2col unwrapping in the convolutions (Justin, is this a fair assumption?), so convolution kernels that dont unwrap will be able to handle larger images, but they might be significantly slower, eg 20-100 times.

jcjohnson commented 9 years ago

@hughperkins There are quite a few things that eat up memory in this project :)

I haven't profiled it, but this is my guess as to what uses memory:

(1) im2col unwrapping, as you mentioned; however this only applies when using nn.SpatialConvolutionMM for convolutions. cuDNN does not use im2col for convolutions, which is why the cudnn backend saves a ton of memory. Of course, you can't use cuDNN on CPU or with OpenCL.

(2) Gradient history for L-BFGS. The L-BFGS algorithm builds up an approximation to the Hessian by storing a history of recent points and gradients; this uses quite a bit of auxiliary memory. Using ADAM rather than L-BFGS for optimization eliminates this, but adds more hyperparameters that need to be tweaked.

(3) VGG-19 activations and gradients. This is the big one. The convolutional part of VGG-19 that we use is comprised entirely of 3x3 convolutions stride 1, 2x2 pooling stride 2, and ReLUs. The architecture for the layers we use is c64-c64-pool-c128-c128-pool-c256-c256-c256-c256-pool-c512-c512-c512-c512-pool-c512. By my calculations, just storing the activations and gradients for all these layers for a 1920x1080 input takes about 5GB of memory. Getting around this is pretty tough.

To be honest I hadn't actually computed how much memory the VGG activations and gradients were taking before just now, and 5GB seems small. When actually running a 1920x1080 image, I see memory usage over 10GB. I wonder where my other 5GB are going? I'll have to investigate.

hughperkins commented 9 years ago

cuDNN does not use im2col for convolutions,

Interesting. That's new information for me. Will ponder this :-)

By my calculations, just storing the activations and gradients for all these layers for a 1920x1080 input takes about 5GB of memory.

Hmmm, right, good point. Each layer uses about 1920 * 1280 * 4 * 64 / 1024 / 1024 = 500MB; and there are 13 layers.

I see memory usage over 10GB. I wonder where my other 5GB are going?

I guess 5GB forwards, for the output tensors, and 5GB backwards, for the gradInput tensors?

You could plausibly reduce memory footprint in the backwards pass, by removing the gradInputtensors, as you move down the layers, at the expense of needing to reallocate them each time. I think the reallocation expense might not be terribly high, so could be worth trying.

(Edit: if you try this, make sure to call collectgarbage() after removing any gradInput tensors, to reclaim the memory by the way)

ryanpamplin commented 9 years ago

Will fbcunn be a lot faster and enable parallel processing? I am very eager for high quality, high resolution images. Thanks again so much!

jcjohnson commented 9 years ago

The convolution implementation in fbcunn probably won't be faster than the one from cuDNN, since VGG-19 uses 3x3 kernels and the FFT-based convolution used by fbcunn only has a speed advantage for large kernels.

However it is possible that the nn.ModelParallel module from fbcunn could allow for higher resolution outputs by distributing the work across multiple GPUs.

dxmachina commented 9 years ago

@hughperkins Yes, dealing with larger image sizes is definitely one goal of mine. But overall just finding a way to start to use these technologies in existing photographic and/or artistic workflows. The "black-box" nature of deep learning can make this difficult - but I am enjoying the tweak-ability of this implementation.

The major areas of interest to me are:

  1. Producing image sizes 2K and hopefully higher someday
  2. Exerting more deterministic artistic control over the models
  3. Exploring animation

It does sound like a large challenge given the enormous memory usage - but I have to believe there's some feasible solution. The trade-off may be waiting time - which is why I'm curious about getting some kind of telling preview before committing to a long process.

Love reading everyone's thoughts and trying to catch up a bit on the science.

dxmachina commented 9 years ago

Maybe a dumb idea- but given the speed of PCIe lanes and SSDs/Memory - would it not be possible to swap data periodically?

jcjohnson commented 9 years ago

@dxmachina It would be possible to swap data out into system memory / disk, but that will be REALLY slow. The problem isn't memory bandwidth so much as latency - copying data from CPU to GPU requires synchronization between host and device, which is slow. As implemented now, we copy data to the device once at the beginning, and during optimization we don't need to copy anything between host and device. If you wanted to save GPU memory by swapping activations and gradients into main memory, you'd end up with multiple host / device synchronizations per optimization step, which would slow everything down quite a bit.

Not to mention that implementing such a strategy would probably be pretty hairy - the built in torch containers don't support that sort of thing, so we'd have to implement it ourselves.

jcjohnson commented 9 years ago

@hughperkins If you're interested, there's a paper that discusses cuDNN's implementation of convolutions here: http://arxiv.org/abs/1410.0759

dxmachina commented 9 years ago

Any thoughts regarding this: http://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/

jjurg commented 9 years ago

Hi Justin @jcjohnson I was thinking the following possible simple fix - let me know if you think that might work.

If we created a new input -init_image that can be used to specify the start image (so adding to -init: Method for generating the generated image; one of random or image) a way to start with a specific image to continue with (we loose the gradients but create a way to seed the starting point).

Then we could write a script that chops up an initial input into let's say 4x the input (using imagemagic). We run the neural-style on each each of the 4 parts on GPU for 2000 interactions (using -style_scale 0.5). Then we can glue the 4 pieces together and run may 200 iterations on CPU for the full size to correct the edges.

What do you think?

schwittlick commented 8 years ago

@jjurg I've been going along that path, but haven't tried to apply more iterations to the stitched image. Did you?

hughperkins commented 8 years ago

Just out of curiosity, why does adam produce worse results than lbfgs? They're both hunting for a local minimum right?