Open gruffdavies opened 8 years ago
Actually, scratch that first question - I do seem to be able to get up to 700px now, I'm not sure what I did wrong but that does seem bigger. The question about the g2.8xlarge remains though. Can I take advantage of this architecture/extra memory? Thanks!
Where do you see they claim it offers additional memory? I am no expert on the subject by any means but their page has a quote that says: "With the benefit of the new g2.8xlarge instances, we can now leverage data parallelism across multiple GPUs..". That leads me to believe it would just run on more chips - as if you had dual GPUs in your computer versus one.
I read it here: https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/
"The 15GB of memory provided by the g2.2xlarge was a limiting factor in OpenEye’s ability to use AWS for FastROCS. The only piece of our cloud offering not yet running in AWS is an on-premises dedicated FastROCS machine. Now that the g2.8xlarge instance provides nearly four times more memory, FastROCS can be run on production-sized pharmaceutically-relevant datasets in AWS."
You're correct though (thanks!) - I misread what that meant - they both have 4GB of video memory.
I get to about 900 px on AWS using cudnn
For me it fails somewhere between 700px and 900px with cudnn on any aws instance.
Are you using Adam? That can also reduce memory usage and let you generate bigger images.
Also neural-style does not currently support acceleration with multiple GPUs, or optimization using both CPU and GPU. It might be possible to support multiple GPUs using ModelParallel
from fbcunn
but I haven't tried it.
Adam gives quite a lot worse results than lbfgs, so it's out of question for me :) Quality > Size
Also, when using -style_scale 2
I can only make ~1.5 smaller images (that's a bummer, because increased style scale often gives better results).
Current algorithm is really memory-unstable on "Setting up style layer" steps. After it starts, it's quite memory-stable, but I saw memory steadily increasing on AWS.
"It might be possible to support multiple GPUs using ModelParallel from fbcunn"
Has anybody tries this? Does ModelParallel
allow you to take advantage of the extra memory (not just the clock speed)?
I tried ADAM but couldn't get good results from it either.
Are the memory constraints with CPU less restrictive (i.e. is it available RAM as opposed to video memory)?
With CPU it's available RAM
I made some experiments and it seems that GPU is only up to 5x faster than CPU on "g2.2xlarge". So might as well use CPU if you need one-off large image!
Apparently the 16-layer VGG caffe model is slightly smaller (it tests nearly as well too) and I suspect it must be loaded on to the GPU. This thread is attempting to get it to work: https://github.com/jcjohnson/neural-style/issues/73
Perhaps even more compact models could be capable of also produce pleasing results. I wonder what is involved with implementing Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo
I've done some experiments with both VGG-16 and CaffeNet; I wasn't able to get good results from CaffeNet, but VGG-16 gives results that are very similar to VGG-19.
I will take a stab at implementing fbcunn. It looks like the spatial convolution operations would need to be ported to fbcunn, correct? This is a whole new language to me (beyond syntax), so any tips, pointers and/or words of encouragement would be greatly appreciated!
Rather than porting the convolutions, I think you would put the entire model inside a fbcunn.ModelParallel: http://facebook.github.io/fbcunn/fbcunn/#fbcunn.fbcunn.ModelParallel.dok
I'm not sure whether this would take the place of the nn.Sequential
container that currently holds the model, or whether it could wrap the existing container.
One thing that that worries me is that fbcunn
pulls in a ton of dependencies (https://github.com/facebook/fbcunn/blob/master/INSTALL.md) that should not be required for neural-style.
Thanks for the pointers. Regarding your concern about dependencies; fbcunn could be made optional correct? Or are you suggesting a fork?
Or what about TensorFlow instead of fbcunn? Apparently it supports a single model across multiple GPUs: https://www.tensorflow.org/versions/master/tutorials/deep_cnn/index.html
Although it looks like TensorFlow too has a number of dependencies...
The thought of combining GPU VRAM the pool is tantalizing to me indeed. Even with some some overhead loss.
Yes, fbcunn
should be optional much like cunn
and cudnn
are currently optional: they are only imported when they are requested via flags.
I think that TensorFlow has a lot of cool features and is something to keep an eye on, but right now it is ~3x slower than Torch (https://github.com/soumith/convnet-benchmarks) which would probably dominate any speedups you got from scaling across multiple GPUs.
Also overall I'm not sure how much of a performance boost we could expect from multiple GPUs. Since we are only using a minibatch size of 1, we can't get speedups from data parallelism. The model is fully sequential, so we can't really run different parts of the model concurrently on different GPUs. fbcunn
claims to be able to split convolutions with many kernels across GPUs; this would certainly allow us to run bigger images by utilizing the memory of all GPUs in the system, but I'm not sure that it will give significant speedups since it would introduce a lot of cross-GPU synchronization.
Good to know about TensorFlow, thanks!
While I understand that there be overhead, I personally would like the option to render at higher resolution at the cost of total performance.
Of course I would only use the option when the memory needed exceeded a single GPU, so it seems like it would be a win/win situation to me.
Talk about GPU envy! Imagine spanning across 16 Maxwell chips!
https://code.facebook.com/posts/1687861518126048/facebook-to-open-source-ai-hardware-design/
I am having a very similar problem when using tensorflow on an p2.xlarge instance. I built a CNN for the kaggle facial competition - images are 96 by 96 - and model run well on the instance. When I use the similar model for images of 480 by 720, the model exhaust the GPU memory... does this make sense to you? Or it could be that the code has bugs?
Firstly, thanks for an awesome implementation!
I've tried lots of variants of setups (CPU/GPU) and settled on AWS but having bumped up against image size limitations, I installed cudnn thinking that would allow me to make bigger images but I'm still struggling to get beyond 512 - 600 ish. I'm on a g2.2xlarge (see table below) but I also tried g2.8x large thinking I'd get both a speed and sized boost but it was exactly the same (although I haven't tried that yet with cudnn).
What image size limits should I expect with these two and is it really the case that they aren't different in terms of performance or am I doing something wrong?
Thanks!