jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

memory & GPU usage question #85

Open gruffdavies opened 8 years ago

gruffdavies commented 8 years ago

Firstly, thanks for an awesome implementation!

I've tried lots of variants of setups (CPU/GPU) and settled on AWS but having bumped up against image size limitations, I installed cudnn thinking that would allow me to make bigger images but I'm still struggling to get beyond 512 - 600 ish. I'm on a g2.2xlarge (see table below) but I also tried g2.8x large thinking I'd get both a speed and sized boost but it was exactly the same (although I haven't tried that yet with cudnn).

Model GPUs vCPU Mem (GiB)
g2.2xlarge 1 8 15
g2.8xlarge 4 32 60

What image size limits should I expect with these two and is it really the case that they aren't different in terms of performance or am I doing something wrong?

Thanks!

gruffdavies commented 8 years ago

Actually, scratch that first question - I do seem to be able to get up to 700px now, I'm not sure what I did wrong but that does seem bigger. The question about the g2.8xlarge remains though. Can I take advantage of this architecture/extra memory? Thanks!

3DTOPO commented 8 years ago

Where do you see they claim it offers additional memory? I am no expert on the subject by any means but their page has a quote that says: "With the benefit of the new g2.8xlarge instances, we can now leverage data parallelism across multiple GPUs..". That leads me to believe it would just run on more chips - as if you had dual GPUs in your computer versus one.

gruffdavies commented 8 years ago

I read it here: https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/

"The 15GB of memory provided by the g2.2xlarge was a limiting factor in OpenEye’s ability to use AWS for FastROCS. The only piece of our cloud offering not yet running in AWS is an on-premises dedicated FastROCS machine. Now that the g2.8xlarge instance provides nearly four times more memory, FastROCS can be run on production-sized pharmaceutically-relevant datasets in AWS."

gruffdavies commented 8 years ago

You're correct though (thanks!) - I misread what that meant - they both have 4GB of video memory.

wrichter commented 8 years ago

I get to about 900 px on AWS using cudnn

sheerun commented 8 years ago

For me it fails somewhere between 700px and 900px with cudnn on any aws instance.

jcjohnson commented 8 years ago

Are you using Adam? That can also reduce memory usage and let you generate bigger images.

jcjohnson commented 8 years ago

Also neural-style does not currently support acceleration with multiple GPUs, or optimization using both CPU and GPU. It might be possible to support multiple GPUs using ModelParallel from fbcunn but I haven't tried it.

sheerun commented 8 years ago

Adam gives quite a lot worse results than lbfgs, so it's out of question for me :) Quality > Size

sheerun commented 8 years ago

Also, when using -style_scale 2 I can only make ~1.5 smaller images (that's a bummer, because increased style scale often gives better results).

Current algorithm is really memory-unstable on "Setting up style layer" steps. After it starts, it's quite memory-stable, but I saw memory steadily increasing on AWS.

josephfinlayson commented 8 years ago

"It might be possible to support multiple GPUs using ModelParallel from fbcunn"

Has anybody tries this? Does ModelParallel allow you to take advantage of the extra memory (not just the clock speed)?

gruffdavies commented 8 years ago

I tried ADAM but couldn't get good results from it either.

gruffdavies commented 8 years ago

Are the memory constraints with CPU less restrictive (i.e. is it available RAM as opposed to video memory)?

josephfinlayson commented 8 years ago

With CPU it's available RAM

ghost commented 8 years ago

I made some experiments and it seems that GPU is only up to 5x faster than CPU on "g2.2xlarge". So might as well use CPU if you need one-off large image!

3DTOPO commented 8 years ago

Apparently the 16-layer VGG caffe model is slightly smaller (it tests nearly as well too) and I suspect it must be loaded on to the GPU. This thread is attempting to get it to work: https://github.com/jcjohnson/neural-style/issues/73

Perhaps even more compact models could be capable of also produce pleasing results. I wonder what is involved with implementing Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo

jcjohnson commented 8 years ago

I've done some experiments with both VGG-16 and CaffeNet; I wasn't able to get good results from CaffeNet, but VGG-16 gives results that are very similar to VGG-19.

3DTOPO commented 8 years ago

I will take a stab at implementing fbcunn. It looks like the spatial convolution operations would need to be ported to fbcunn, correct? This is a whole new language to me (beyond syntax), so any tips, pointers and/or words of encouragement would be greatly appreciated!

jcjohnson commented 8 years ago

Rather than porting the convolutions, I think you would put the entire model inside a fbcunn.ModelParallel: http://facebook.github.io/fbcunn/fbcunn/#fbcunn.fbcunn.ModelParallel.dok

I'm not sure whether this would take the place of the nn.Sequential container that currently holds the model, or whether it could wrap the existing container.

One thing that that worries me is that fbcunn pulls in a ton of dependencies (https://github.com/facebook/fbcunn/blob/master/INSTALL.md) that should not be required for neural-style.

3DTOPO commented 8 years ago

Thanks for the pointers. Regarding your concern about dependencies; fbcunn could be made optional correct? Or are you suggesting a fork?

Or what about TensorFlow instead of fbcunn? Apparently it supports a single model across multiple GPUs: https://www.tensorflow.org/versions/master/tutorials/deep_cnn/index.html

Although it looks like TensorFlow too has a number of dependencies...

The thought of combining GPU VRAM the pool is tantalizing to me indeed. Even with some some overhead loss.

jcjohnson commented 8 years ago

Yes, fbcunn should be optional much like cunn and cudnn are currently optional: they are only imported when they are requested via flags.

I think that TensorFlow has a lot of cool features and is something to keep an eye on, but right now it is ~3x slower than Torch (https://github.com/soumith/convnet-benchmarks) which would probably dominate any speedups you got from scaling across multiple GPUs.

Also overall I'm not sure how much of a performance boost we could expect from multiple GPUs. Since we are only using a minibatch size of 1, we can't get speedups from data parallelism. The model is fully sequential, so we can't really run different parts of the model concurrently on different GPUs. fbcunn claims to be able to split convolutions with many kernels across GPUs; this would certainly allow us to run bigger images by utilizing the memory of all GPUs in the system, but I'm not sure that it will give significant speedups since it would introduce a lot of cross-GPU synchronization.

3DTOPO commented 8 years ago

Good to know about TensorFlow, thanks!

While I understand that there be overhead, I personally would like the option to render at higher resolution at the cost of total performance.

Of course I would only use the option when the memory needed exceeded a single GPU, so it seems like it would be a win/win situation to me.

3DTOPO commented 8 years ago

Talk about GPU envy! Imagine spanning across 16 Maxwell chips!

https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/

shushi2000 commented 7 years ago

I am having a very similar problem when using tensorflow on an p2.xlarge instance. I built a CNN for the kaggle facial competition - images are 96 by 96 - and model run well on the instance. When I use the similar model for images of 480 by 720, the model exhaust the GPU memory... does this make sense to you? Or it could be that the code has bugs?