jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

Degraded performance after updating from Cuda 8.0 to Cuda 9.0, and cuDNN v5 to cuDNN v7? #429

Open ProGamerGov opened 7 years ago

ProGamerGov commented 7 years ago

After using th neural_style.lua -gpu 0 -backend cudnn to compare an older install of Neural-Style, to a newer one, I noticed that the performance seems to have gotten worse. Is this something that's fixable? Or are the new versions of Cuda and cuDNN just not as efficient as the previous versions?


cuda-repo-ubuntu1604_8.0.44-1_amd64.deb, cudnn-8.0-linux-x64-v5.0-ga.tgz

Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-38-generic x86_64)

ubuntu@ip-Address:~$ nvidia-smi
Wed Oct 25 00:23:28 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   59C    P0   137W / 149W |   1365MiB / 11439MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1536    C   /home/ubuntu/torch/install/bin/luajit         1363MiB |
+-----------------------------------------------------------------------------+

Compared to:


Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-1038-aws x86_64)

cudnn-9.0-linux-x64-v7.tgz, libcudnn7_7.0.3.11-1+cuda9.0_amd64.deb

ubuntu@ip-Address:~$ nvidia-smi
Wed Oct 25 00:24:57 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   55C    P0   144W / 149W |   1755MiB / 11439MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1653      C   /home/ubuntu/torch/install/bin/luajit       1744MiB |
+-----------------------------------------------------------------------------+

Before the network loads the model, more memory is also used:

|    0      2065      C   /home/ubuntu/torch/install/bin/luajit        200MiB |

This is a pretty large change in terms of resource usage, and it's definitely not for the better.

ProGamerGov commented 7 years ago

Testing Cuda 8.0 (cuda-repo-ubuntu1604_8.0.61-1_amd64.deb, CUDA Toolkit 8.0 GA2 (Feb 2017)) with th neural_style.lua -gpu 0 -backend cudnn, and cudnn-8.0-linux-x64-v5.0-ga.tgz, seems to use more memory as well. This is interesting compared to the CUDA Toolkit 8.0 GA1 (Sept 2016) version, tested above.

ubuntu@ip-Address:~/neural-style$ nvidia-smi
Wed Oct 25 20:50:25 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   50C    P0   149W / 149W |   1726MiB / 11439MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     29512      C   /home/ubuntu/torch/install/bin/luajit       1715MiB |
+-----------------------------------------------------------------------------+

This makes me wonder if CUDA Toolkit 8.0 GA2 (Feb 2017) added something new that Neural-Style doesn't require, and thus we can disable/remove it in order to lower memory use.

ProGamerGov commented 7 years ago

Another setup for comparison:

Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-79-generic x86_64)

cuda-repo-ubuntu1404_7.5-18_amd64.deb, cudnn-7.0-linux-x64-v4.0-prod.tgz

Cuda 7.5, and cuDNN v4

Setup memory usage:

ubuntu@ip-Address:~/neural-style$ nvidia-smi
Wed Oct 25 23:34:49 2017
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   46C    P0    70W / 149W |    154MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1564    C   /home/ubuntu/torch/install/bin/luajit           97MiB |
+-----------------------------------------------------------------------------+

Running th neural_style.lua -gpu 0 -backend cudnn:

ubuntu@ip-Address:~/neural-style$ nvidia-smi
Wed Oct 25 23:37:28 2017
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   63C    P0   142W / 149W |   1375MiB / 11519MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1564    C   /home/ubuntu/torch/install/bin/luajit         1319MiB |
+-----------------------------------------------------------------------------+

It looks like the memory usage significantly increased in the Cuda 8.0 (Feb 2017) update, while previous versions of Cuda were around 23.5% more efficient, unless the issue scales with the image size value. If the issue scales with the image size value, that would significantly lower the largest possible image size.

ProGamerGov commented 6 years ago

An interesting side effect of changing the Cuda, Cudnn, Torch7 versions, and maybe even the Ubuntu versions), is that the seed value effects seems to change.

So if you use -seed 876 with Cuda 8.0, etc..., then that same seed value will not create the same output with Cuda 9.0, etc...

ProGamerGov commented 6 years ago

Another way to slightly lower memory usage, seems to be possible by stripping layers from a VGG model:

https://github.com/jcjohnson/neural-style/issues/428#issuecomment-370185610

flaushi commented 6 years ago

have you tried instance normalization with your setups? I encounter problems here ....

ProGamerGov commented 6 years ago

@flaushi Are you refering to instance normalization from fast-neural-style?

flaushi commented 6 years ago

Yes!

ajhool commented 6 years ago

It looks like the memory usage significantly increased in the Cuda 8.0 (Feb 2017) update, while previous versions of Cuda were around 23.5% more efficient, unless the issue scales with the image size value. If the issue scales with the image size value, that would significantly lower the largest possible image size.

I'm confused @ProGamerGov , the GPU memory usage definitely scales with image output size -- were these CUDA benchmarks taken with the same image output size? One of my ec2 snapshots has a corrupted cuda install, somehow, so I'm deciding which cuda to go with -- would rather use the most memory-efficient.

ProGamerGov commented 6 years ago

@ajhool I believe that I was using th neural_style.lua -gpu 0 -backend cudnn for all the tests, according to my first post in this thread. I currently have 3-4 different EC2 AMIs with CUDA installed, so I can check again if you like. But I do remember finding that parameter sets which worked with earlier version of CUDA, running out of memory on the latest versions.

ajhool commented 6 years ago

That's really interesting. I sensed a significant speed increase on the newer versions of Cuda and cudnn, but "sensed" is the operative word because I didn't do a controlled benchmark. I also sensed the memory usage increase. The latest Cuda supports the Volta GPU, too, so I'm excited to see if there are significant rendering gains to be made, there.

What I'm actually finding (I think) is that for smaller renders the main bottleneck occurs in some process that luajit is executing on the CPU. So, when I run multiple renders on the same GPU (~60% GPU memory usage), I am seeing significant (2x-3x) rendering slowdowns, and I am also seeing all 4 of the CPU cores get locked up at 100% due to a luajit process. I believe that I'd posted an issue about this in the past but might be misremembering. But concurrent rendering seems to be constrained by the CPU before the GPU, frustratingly.