jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

100% CPU usage of 4GB (despite algorithm running on Tesla K80) #409

Open ajhool opened 7 years ago

ajhool commented 7 years ago

Specs: Confirmed on AWS and Google Compute Engine: 4x 4GB vCPU Tesla K80 (with my parameters, GPU utilization steady state is only about 20% ) 50GB SSD

Problem: I have confirmed on multiple platforms (and with other users) that this algorithm locks a 4GB CPU at 100%. I have 4 cores on my system (16G total), but only one of the cores is utilized.

I'm not sure how this impacts style transfer performance, but I believe it limits the number of algorithms that can run simultaneously -- people are busy looking at the GPU and don't realize that CPU is struggling.

To find the CPU utilization, I used the program "top". Linux command: top "luajit" is the culprit

Questions

  1. Is it possible to parallelize the CPU utilization using Torch? (I'm not familiar with language)
  2. What is the CPU being used for?
ajhool commented 7 years ago

Tried: export OMP_NUM_THREADS=4 source ~/.bashrc

Result: No apparent change

htoyryla commented 7 years ago

Here (Ubuntu 14.04 on i7 8 threads) torch utilizes all 8 threads nicely; running neural-style with -gpu -1 uses cpu up to 790%.

Google for "torch multithreading". If multithreading does not work, it might be that torch has been compiled without OpenMP support (at least known to happen on OSX with an older version of gcc).

BTW I don't understand "4x 4GB vCPU" "16GB total". My cores all share the same RAM and I fail to see how it could be otherwise.

ajhool commented 7 years ago

BTW I don't understand "4x 4GB vCPU" "16GB total". My cores all share the same RAM and I fail to see how it could be otherwise.

I was just pointing out that RAM scales with the number of cores you use when configuring cloud instances. The problem is that CPU usage seems to be limited to one of the 4 cores, causing 100% utilization of that core while leaving the others unused. Memory usage is only sitting at 22%.

I suspect that using -gpu -1 causes a change to the prediction portion of the algorithm that enables CPU multithreading. I'm experiencing this problem while running the prediction on the GPU (ie. -gpu 0), so it's a separate portion of the code.

I've been looking into OpenMP support and OpenBLAS to see if the issue could be there.

htoyryla commented 7 years ago

"I suspect that using -gpu -1 causes a change to the prediction portion of the algorithm that enables CPU multithreading. I'm experiencing this problem while running the prediction on the GPU (ie. -gpu 0), so it's a separate portion of the code."

I am quite familiar with neural_style code, except for the multigpu part, and the effect of -gpu option is setting the torch variable types (FloatTensor or CudaTensor) and devices accordingly. What happens after that is up to Torch to decide.

Nevertheless, I too see that when using -gpu 0, CPU utilization stays around 100% (with a peak up to 110 ... 125 % at the start). On the other hand GPU utilization stays close to 100%. I even tried the same with the pre-12/2017 version of neural_style (which does not, for instance, have multigpu support), with similar results (expect that GPU memory usage is significantly lower with the old version... something to investigate).

So in my case, I see no problem, a single CPU is enough and no more is needed nor allocated. I don't know how Torch manages the number of CPU threads, though.

Even if neural_style does not explicitly control parallelization, I can imagine that it is possible to create a bottleneck between CPU and GPU through the use of Float and Cuda tensors. Even so, much of the actual processing happens outside neural_style, in the torch packages nn, cutorch, cunn, cudnn and optim. And with my setup, I am not experiencing any such bottleneck.

htoyryla commented 7 years ago

I found a way to create a "bottleneck" by setting -save_iter 1. Then cpu util was constantly around 117% and GPU util alternated between 34% and 80+%. This, too, makes sense as the algorithm is constantly interrupted by the need to take the image, deprocess and save it.

Using -print_iter 1 was not enough to cause a bottleneck. GPU util was almost 100% all the time, CPU util around 100%.

All my tests have been done on dedicated computers with i7 CPU with 8 threads and a GTX1070 alternatively GTX960 GPU. Perhaps what you are experiencing is something specific to a virtual environment.

dovanchan commented 7 years ago

@htoyryla Hi, have you ever try to use another frame for Neural style (such as tensorflow)?I think torch is no good for the product enviroment. I am trying to develop a 'ostagram' with ios mobile app verision .But I don't know which frame is good for us and it can support more people stylized the image in the same time with AWS p2.xlage K80 Gpu... I am no familiar to the deep learning frame,Can you give me some suggestion?

htoyryla commented 7 years ago

@dovanchan I am more oriented towards experimenting so I cannot really comment what would work for a production system. For my purposes I have found tensorflow a bit cumbersome, torch and recently pytorch work better for me. But that is from experiments' point of view (need to write and modify code quickly).

I don't know if the framework matters too much, the challenges of memory and GPU capacity will be there whatever you use.

dovanchan commented 7 years ago

@htoyryla Thanks for your answer. Can you recommend another people you known who may be can solve my problem so that I can ask him for help? and if he can help me ,I will to pay for it;

ajhool commented 7 years ago

@htoyryla

Using -print_iter 1 was not enough to cause a bottleneck. GPU util was almost 100% all the time, CPU util around 100%.

Would you mind clarifying that? Isn't 100% usage of both processors a bottleneck, or is a bottleneck usually considered to be a program crash?

I agree that it could be a product of the environment -- so far, I've only been able to confirm this issue on cloud platforms (AWS, GCE) and do not have hands-on access to a comparable GPU.

So far, I have still not resolved this issue.

@dovanchan there are multiple implementations of neural-style in Tensorflow available on github. In my experience, those implementations consume more GPU resources than this implementation, given similar parameters.

htoyryla commented 7 years ago

"Would you mind clarifying that? Isn't 100% usage of both processors a bottleneck, or is a bottleneck usually considered to be a program crash?"

With "bottleneck" I was referring to what I wrote earlier: "Even if neural_style does not explicitly control parallelization, I can imagine that it is possible to create a bottleneck between CPU and GPU through the use of Float and Cuda tensors."

100% GPU means that GPU is running at full capacity, while 100% CPU in an 8-thread CPU is only 12.5% . As GPU is running full and CPU has much capacity left, there is no bottleneck causing the GPU to be idle for some time (whereas in your case there appears to be a bottleneck somewhere preventing GPU to be utilized fully).

So, to identify what could cause GPU to run idle, I was looking at what the CPU is doing during the iterations. Usually not much, just checking at each iteration what needs to be done next. Then when I used -save_iter 1 the bottleneck occurred probably in writing to the disk at every iteration. Print_iter 1, which only means outputting the losses to the console, did not significantly reduce GPU utilization (this could be different when using an ssh session, haven't tried it).

PS. Of course you can call GPU 100% a bottleneck too, but I understood that was not your problem, but rather that GPU was not running at full capacity because of a bottleneck somewhere (which you thought was due to lack of CPU parallelization). And it was that bottleneck I was interested in. But maybe I misunderstood your problem.

ajhool commented 7 years ago

Ah, yes that makes total sense, thanks. I think you understand my problem perfectly. I originally thought that the CPU was only/predominantly being utilized for saving iterations, so that's why I was so surprised to see 100% usage. I'm not experienced with torch/lua so I was hoping to get by with just enough knowledge -- but, as is usually the case, I'll need to dedicate more time to learning the fundamentals.

htoyryla commented 7 years ago

I originally thought that the CPU was only/predominantly being utilized for saving iterations

Looking at neural_style code, that is essentially what the CPU does once it gets the iterations going. So if you set print_iter 1000 save_iter 1000 there is very little the CPU does between iterations: check whether to print or save, decide not to do either, and then start a new iteration. It is not clear to me either why even a single thread is running 100%. That is probably due to how torch works and how it handles the operations between CPU and GPU, and I have not so far delved so deep.