jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.71k forks source link

FP16 version of neural style #425

Closed michaelhuang74 closed 6 years ago

michaelhuang74 commented 6 years ago

I am trying to carry out neural style in half precision, i.e., 16-bit floating point. I modified the code and it was posted on https://github.com/michaelhuang74/FP16-Neural-Style/blob/master/neural_style_FP16.lua

(1) The above version doesn't work with lbfgs optimizer. (2) When I use the adam optimizer , it does run on Titan Xp. FP16 performance on Titan Xp is supposed to 1/64 of FP32 performance. But the current implementation is only a bit slower than FP32 version. In addition, it has NaN (not a number) issue. Follows are the outputs.

th neural_style.lua -style_image style/vangogh.jpg -content_image inputimage/man_face.jpg -style_weight 20 -content_weight 1 -image_size 800 -style_scale 1 -style_layers relu1_2,relu2_2,relu3_2,relu4_2,relu5_2 -output_image outputimage/man_face.vangogh.sw20.cw1.sl12-22-32-42-52.c800s800.1000it.adam.cudnn.half.jpg -num_iterations 1000 -save_iter 100 -backend cudnn -cudnn_autotune -optimizer adam

Running optimization with ADAM Iteration 100 / 1000 Content 1 loss: nan Style 1 loss: nan Style 2 loss: nan Style 3 loss: nan Style 4 loss: nan Style 5 loss: nan Total loss: nan Iteration 200 / 1000 Content 1 loss: nan Style 1 loss: nan Style 2 loss: nan Style 3 loss: nan Style 4 loss: nan Style 5 loss: nan Total loss: nan

I changed the epsilon from the default 1e-8 to 1e-4 or 1e-2, still facing the nan (not a number) output.

Any idea how to modify the code to make the FP16 work?

htoyryla commented 6 years ago

I tried your code and I am getting similar results (cuda 8.0 on gtx1070). Tried some modifications (like using :cudaHalf() to convert existing variables but without significant results.

However, also the l-bfgs has the nan issue. The first content and style loss evaluation gives ''' Capturing style target 1 Running optimization with L-BFGS Iteration 1 / 500 Content 1 loss: 171200.000000 Style 1 loss: nan Total loss: nan ''' before the cublas error occurs.

Using -init image results in content loss: 0 which makes sense. But somehow the style loss evolution does not work.

Using adam, neither content or style loss evaluation works.

I don't know much about inner workings of torch and even less about cuda implementation or debugging when using cuda. My next approach would be using simpler code first to try out how cudahalf model and variables work, step by step, like first only forwarding an image through a model, then trying to evaluate style with a gram matrix, before even introducing the optimizer.

An alternative approach would be to try pytorch. They have a tutorial on how to implement neural-style (BTW the clearest introduction to style transfer process I have seen) http://pytorch.org/tutorials/advanced/neural_style_tutorial.html

htoyryla commented 6 years ago

Checking the captured style targets with a print statement here (best to use only one style_layer for this):

-- Set all loss modules to loss mode
  for i = 1, #content_losses do
    content_losses[i].mode = 'loss'
  end
  for i = 1, #style_losses do
    style_losses[i].mode = 'loss'
    print(style_losses[i].target)
  end

shows that the style target is already full of nans, before the optimizer is started, so it is no wonder that we end with fans when calculating losses using such a target. So, something is wrong now with the calculation of style loss.

It appears to me that nans result in the gram matrix calculation. Here, self.dbg stores input that looks correct bu self.G is full of nans.

function StyleLoss:updateOutput(input)
  self.dbg = input  
  self.G = self.gram:forward(input):cudaHalf()

However, when I write a small test, the gram module works without producing nans (even if I divide the random matrix to very small values)

require 'torch'
require 'nn'
require 'cutorch'
require 'cunn'
require 'cudnn'

local Gram, parent = torch.class('nn.GramMatrix', 'nn.Module')

function Gram:__init()
  parent.__init(self)
end

function Gram:updateOutput(input)
  assert(input:dim() == 3)
  local C, H, W = input:size(1), input:size(2), input:size(3)
  local x_flat = input:view(C, H * W):cudaHalf()
  self.output:resize(C, C):cudaHalf()
  self.output:mm(x_flat, x_flat:t())
  return self.output
end

function Gram:updateGradInput(input, gradOutput)
  assert(input:dim() == 3 and input:size(1))
  local C, H, W = input:size(1), input:size(2), input:size(3)
  local x_flat = input:view(C, H * W)
  self.gradInput:resize(C, H * W):mm(gradOutput, x_flat)
  self.gradInput:addmm(gradOutput:t(), x_flat)
  self.gradInput = self.gradInput:view(C, H, W)
  return self.gradInput
end

input = torch.rand(4,32,32):cudaHalf()
gram = nn.GramMatrix():cudaHalf()
output = gram:forward(input)
print(output)
htoyryla commented 6 years ago

It appears that I can reproduce the error with the above test program, when instead of a random input to gram matrix, I use the actual input which causes the error (saving it to a t7 file from the modified neural_style. So, the input to a style loss module causes the gram matrix to contain mostly infs, when using cudahalftensors. The same input works fine with cudatensors.

htoyryla commented 6 years ago

Comment removed... could not reproduce the results.

OK... I can get the gram matrix to work, by using cuda inside the GramMatrix module and converting back and forth on input and output. Content and style losses now have reasonable numeric values. The cublas error still persists. Adam is not converging and produces a grey image. On the other hand I may have messed up how the gradient is calculated now.

And I am not saying that this kind of approach would make any sense in practice.

Looking further what can go wrong with CudaHalfTensor, looks like the mm operation can overflow even if the input tensor have reasonable value ranges (like 0...42). Perhaps normalizing the values somewhere would help. Or even using the normalized version of vgg (no... this does not help).

michaelhuang74 commented 6 years ago

@htoyryla Could you share your code that is able to calculate content and style losses? Thanks for your time spending on debugging this issue.

htoyryla commented 6 years ago

@michaelhuang74 here's the code which manages to calculate losses with the downside being the use of FP32 within the gram matrix calculation. The program is still not working correctly otherwise. Using l-bfgs fails with a cublas related error. Using adam runs but losses stay the same and the image does not change. I guess this is due to gram matrix gradient still being computed using FP16, where mm and addmm probably suffer from the same problem as during the forward calculation. I did not yet succeed in implementing a similar workaround there. It looks like I can calculate a reasonable looking gradient but the next iteration gets a nan input from adam.

https://gist.github.com/htoyryla/233a9d0857440d2a8bafe732ddeba325

michaelhuang74 commented 6 years ago

@htoyryla I updated the code at https://github.com/michaelhuang74/FP16-Neural-Style/blob/master/neural_style_FP16.lua

I have moved the gram matrix part to 32-bit. When I use the VGG for the neural style, the MSECriterion:updateGradInput method in 16-bit mode will generate 'inf' and 'nan' in the second iteration. Therefore, I tried to move the MSECriterion to 32-bit. The code generated the following error: lua/5.1/nn/THNN.lua:110: bad argument #4 to 'v' (cannot convert 'struct THCudaHalfTensor ' to 'struct THCudaTensor ')

Then I tried to create my own version of the MSECriterion, i.e. SELF_MSECriterion. It generated the same error. My command is as follows.

th neural_style_half.lua -style_image style/vangogh.jpg -content_image inputimage/man_face.jpg -style_weight 20 -content_weight 1 -image_size 100 -style_scale 1 -style_layers relu4_2 -output_image outputimage/man_face.vangogh.sw20.cw1.sl42.c100s100.3it.adam.cudnn.half.vgg.jpg -num_iterations 3 -save_iter 1 -print_iter 1 -backend cudnn -cudnn_autotune -optimizer adam

The complete error message is as follows.

/home/mqhuang/torch/install/bin/luajit: /home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:67: In 25 module of nn.Sequential: /home/mqhuang/torch/install/share/lua/5.1/nn/THNN.lua:110: bad argument #4 to 'v' (cannot convert 'struct THCudaHalfTensor ' to 'struct THCudaTensor ') stack traceback: [C]: in function 'v' /home/mqhuang/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'MSECriterion_updateGradInput' neural_style_half.lua:682: in function 'backward' neural_style_half.lua:599: in function [C]: in function 'xpcall' /home/mqhuang/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/mqhuang/torch/install/share/lua/5.1/nn/Sequential.lua:55: in function 'updateGradInput' neural_style_half.lua:294: in function 'opfunc' /home/mqhuang/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' neural_style_half.lua:326: in function 'main' neural_style_half.lua:694: in main chunk [C]: in function 'dofile' ...uang/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670

michaelhuang74 commented 6 years ago

I have updated my code with a functional FP16 implementation.

htoyryla commented 6 years ago

I see you had to implement both loss modules using FP32, while the actual model uses FP16, actually makes sense. Have you any figures about how much memory this saves.

PS. Tried your code myself. Got an error (using updated torch, nn, cutorch, cunn, cudnn): home/hannu/torch/install/bin/luajit: /home/hannu/torch/install/share/lua/5.1/optim/lbfgs.lua:183: cublas runtime error : unknown error at /tmp/luarocks_cutorch-scm-1-8481/cutorch/lib/THC/THCBlas.cu:67

michaelhuang74 commented 6 years ago

@htoyryla There is a bug for the LBFGS optimzer in FP16 mode. You need to use the Adam optimizer for the FP16 mode.

I haven't taken a close look at how much memory can be saved by using FP16. Will collect some figures later regarding memory saving.

htoyryla commented 6 years ago

@michaelhuang74 Thanks, now it works.

I wonder though, it runs way too fast on my GTX1070, which is supposed to have crippled FP16 performance. Does it really use FP16? Nvidia-smi shows memory usage 1158MiB with neural_style_FP16.lua and 1058MiB using the original.

michaelhuang74 commented 6 years ago

@htoyryla I have the similar experience on Titan Xp and Nvidia Telsa V100 using Torch. FP16 and FP32 implementations have the similar speed and almost same memory usage.

Not sure what's going on. My guess is that the same CUDA 32-bit core is used for both FP16 and FP32 on the Pascal architecture. Then you see the same speed for both FP16 and FP32 implementations.

On Volta architecture, there are newly designed Tensor Cores. But Torch is not using it for FP16. I am testing the FP16 version in PyTorch, which supports the new Tensor Cores.

Coding in PyTorch (see https://github.com/leongatys/PytorchNeuralStyleTransfer for neural style) is more flexible. So I may start using PyTorch in the future for neural networks.

htoyryla commented 6 years ago

Pytorch, yes. I have already moved to it for much of my work.

ProGamerGov commented 6 years ago

While looking for something else, I found some information that I thought might help with FP16 in regards to CUDA: https://github.com/torch/cutorch/blob/master/lib/THC/CMakeLists.txt#L218-L225

Basically if the CUDA version is less than 7.5, cutorch is not compiled with FP16 support.