jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

Non-issue: relative style layers weights #237

Open htoyryla opened 8 years ago

htoyryla commented 8 years ago

I have for a long time been interested in the idea of adjusting the weights of style layers separately. For instance if I see that some layer is having far larger or smalled than typical losses, it would be interesting to be able to adjust the weights.

I have now implemented this in https://github.com/htoyryla/neural-style by adding a new parameter -style_layer_weights, which accepts a comma-separated list of multipliers. The weight of each style layer will then be multiplier * style_weight. If the parameter is omitted, each layer will have multiplier = 1.

I assume that this is not for the average user, and that's why I didn't make this a pull request. But there might be other experiment-oriented users around, so I am letting them know that this exists.

StryxZilla commented 8 years ago

@htoyryla is there a good way of understanding what each style layer is filtering?

htoyryla commented 8 years ago

StryxZilla notifications@github.com kirjoitti 5.6.2016 kello 18.23:

@htoyryla is there a good way of understanding what each style layer is filtering?

Not in exact terms, as far as I know. The very lowest layers see colors, then going upwards: lines, curves and shapes, until the upper layers see complex features.

This https://github.com/htoyryla/convis can be used to view what the filters of a layer see in a particular image.

bmaltais commented 8 years ago

@htoyryla I think you are onto something with the style_layer_weights. I did a quick test and even with a rough "balancing" I was able to bring the style loss of each layer to be about the same. The resulting image was better than without.

This made me think. What if for each iteration neural-style was to take the style_losses values, get the max value and then apply a weight correction for the next pass. This would essentially try to balance the next iteration loss based on the observes loss of the previous iteration. At then end, the goal of style transfer is to transfer the style of the source image... but the current method leave individual layer style loss to vary. I think the style loss of iteration layer should be the same for best style transfer.

I tried to implement this but my programming skills in lua are way too bad. The best I can do is pseudo-code if someone was brave enough to implement:

maxloss = max (style_losses)
for i,  loss_module in ipairs(style_losses) do
  if maxloss = 0 then # in case this is the 1st run and there is no known loss
    layerweight = 1
  else
    layerweight = maxloss / loss_module
end
htoyryla commented 8 years ago

bmaltais notifications@github.com kirjoitti 13.6.2016 kello 20.14:

@htoyryla I think you are onto something with the style_layer_weights. I did a quick test and even with a rough "balancing" I was able to bring the style loss of each layer to be about the same. The resulting image was better than without.

This made me think. What if for each cycle neural-style was to take the style_losses values, get the max value and then apply a weight correction for the next pass. This would essentially try to balance the next cycle loss based on the observes loss of the previous cycle.

I could try something like this (and a similar idea crossed my mind earlier) when I have time.

Intuitively, I feel that there is a danger of disrupting the balance between style and content, which, with fixed weights, naturally changes during the iterations when at times content, at times style, converges better. But this is simply something to keep in mind and make sure that this automatic balancing to style weights does not too much change the overall weight of style vs content.

bmaltais commented 8 years ago

I was under the impression that driving the style loss would result in greater content loss for some layers but would result in a closer style match at the end... but better final style overall. I guess this will only be known if you are able to implement this potential feature ;-)

Looking forward to test it if you ever find the time to code it in.

htoyryla commented 8 years ago

bmaltais notifications@github.com kirjoitti 13.6.2016 kello 21.07:

I was under the impression that driving the style loss would result in greater content loss for some layers but would result in a closer style match at the end... with greater loss of content details in some are... but better style overall.

I guess there are several ways of balancing style weights across layers, and some ways would change the overall style-content balance while others wouldn’t. My preference is to keep style balancing separate from the style weight / content weight adjustments. Of course, automatic adjustment of style versus content weight might also be an interesting concept, but it is a different concept.

This is something to remember already now when experimenting with style layer weights. Do we get different results because we get better control from all layers, or because we have actually increased the overall style weight (for which we do not need layer specific weights).

Personally, I am not sure any single procedure will result in improvements with all materials. The optimization process in neural-style is so complex. Change something and the process will suddenly find another valley altogether. Yet, it is good to have settings to adjust when looking for better results.

But let’s see. This is interesting enough to try but just now I am busy with other things. Perhaps next week.

bmaltais commented 8 years ago

Looking at my pseudo code I think I had it backward. I was in fact accentuating instead of balancing the loss. The following code appear to be better suited:

minloss = min (style_losses)
local style_layer_weights = {}
for i,  loss_module in ipairs(style_losses) do
  if minloss = 0 then # in case this is the 1st run and there is no known loss
    table.insert(style_layer_weights, 1)
  else
    style_layer_weights_val = minloss / loss_module
    table.insert(style_layer_weights, style_layer_weights_val)
end

This code will essentially reduce the weight of high loss layers, hence resulting in a balance of loss when compared with the layer with the least loss.

It will also be interesting to see the layer weight values throughout the iterations to see if they are stable or fluctuate a lot. My basic attempts with your current code base that support style_layer_weights is producing really sharp results. Almost like it is taking the grain out of the output and producing an image that is much closer to the style look. This is using NIN by the way. I have not tried with VGG19 yet.

htoyryla commented 8 years ago

I was anyway planning to think it through myself. I think that if the goal is to balance the style losses from different layers, it can (and should) be done inside feval(), as that is what is run in each iteration. And if the balance can be adjusted right away then actually there is no need for storing weights for the next iteration.

Maybe you could now try your version? Have a look at feval(). There you have access to style losses from each layer and you can adjust them before they are added to the total loss.

Just note that style_losses contains the loss modules (= code + data to handle loss) , and in

for _, mod in ipairs(style_losses) do loss = loss + mod.loss end it is mod.loss that contains the style loss value from the current iteration.

Hannu

bmaltais notifications@github.com kirjoitti 13.6.2016 kello 22.42:

Looking at my pseudo code I think I had it backward. I was in fact accentuating instead of balancing the loss. The following code appear to be better suited:

minloss = min (style_losses) local style_layer_weights = {} for i, loss_module in ipairs(style_losses) do if minloss = 0 then # in case this is the 1st run and there is no known loss table.insert(style_layer_weights, 1) else style_layer_weights_val = minloss / loss_module table.insert(style_layer_weights, style_layer_weights_val) end This code will essentially reduce the weight of high loss layers, hence resulting in a balance of loss when compared with the layer with the least loss.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

htoyryla commented 8 years ago

bmaltais notifications@github.com kirjoitti 13.6.2016 kello 22.42:

if minloss = 0 then # in case this is the 1st run and there is no known loss Furthermore, if I am not mistaken, already the first run produces valid loss values, as the loss is computed, roughly speaking, as the difference between the target and the current result.

It may still be necessary to have special handling for the first iteration, but it is not immediately obvious for me. Then, if you need to know whether this is the first run, you can check it from num_calls.

bmaltais commented 8 years ago

Yeah, I am not very good with lua I must say ;-) This was just basic pseudo code to express how I thought the problem could be approached but I trust you can do a much better and optimized one ;-)

I will publish a link to a comparison of no layer_style weight (all 1) vs a tuned one using the 1000th iteration result as the adjustment values using the formula in the 2nd pseudo code above.

(I am using your branched version of neural-style for that)

bmaltais commented 8 years ago

Here is the comparison: https://twitter.com/netputing/status/742449976429535232

Pretty radical if you ask me

Source images were: https://twitter.com/netputing/status/742450429850570752

bmaltais commented 8 years ago

Tweak was: -style_layer_weights .288,.255,.249,.009,.025,.093,.005,.043,1,.026

With 1st image loss being:

Iteration 1000 / 1000
  Content 1 loss: 0.003941
  Content 2 loss: 0.004568
  Content 3 loss: 0.003562
  Content 4 loss: 0.026721
  Content 5 loss: 0.014100
  Content 6 loss: 0.005582
  Content 7 loss: 0.019614
  Content 8 loss: 0.006781
  Content 9 loss: 0.001201
  Content 10 loss: 0.005899
  Style 1 loss: 2142.516710
  Style 2 loss: 2421.563864
  Style 3 loss: 2480.307221
  Style 4 loss: 70729.875565
  Style 5 loss: 24952.839315
  Style 6 loss: 6668.104231
  Style 7 loss: 119798.827171
  Style 8 loss: 14482.873678
  Style 9 loss: 617.072918
  Style 10 loss: 23917.636275
  Total loss: 268211.708918

vs tweaked loss being:

Iteration 1000 / 1000
  Content 1 loss: 0.003790
  Content 2 loss: 0.004413
  Content 3 loss: 0.003455
  Content 4 loss: 0.026003
  Content 5 loss: 0.013749
  Content 6 loss: 0.005467
  Content 7 loss: 0.019187
  Content 8 loss: 0.006695
  Content 9 loss: 0.001188
  Content 10 loss: 0.005787
  Style 1 loss: 141.991130
  Style 2 loss: 108.425413
  Style 3 loss: 84.199913
  Style 4 loss: 1128.063941
  Style 5 loss: 648.735538
  Style 6 loss: 457.198982
  Style 7 loss: 989.381075
  Style 8 loss: 584.940869
  Style 9 loss: 384.389726
  Style 10 loss: 583.861947
  Total loss: 5111.278266
htoyryla commented 8 years ago

bmaltais notifications@github.com kirjoitti 13.6.2016 kello 23.10:

Yeah, I am not very good with lua I must say ;-) This was just basic pseudo code to express how I thought the problem could be approached but I trust you can do a much better and optimized one ;-)

No, I frankly meant that you might try it yourself. Have a look at feval(), it is pretty compact. There you have all style loss values (mod.loss values in the loop) before they are added to the total loss. You just need to add your logic for adjusting them.

I think you need to adjust the loss values, not the weights, because when the style modules have been created, they have been initialized with the weights then, so the weights cannot be used dynamically as they work now.

But it should be rather simple to collect the values from each mod.loss, find minimum or maximum, make adjustments and then add them together.

htoyryla commented 8 years ago

Have a look at feval(), it is pretty compact. There you have all style loss values (mod.loss values in the loop) before they are added to the total loss. You just need to add your logic for adjusting them.

I think you need to adjust the loss values, not the weights, because when the style modules have been created, they have been initialized with the weights then, so the weights cannot be used dynamically as they work now.

But it should be rather simple to collect the values from each mod.loss, find minimum or maximum, make adjustments and then add them together.

One more thing� if and when the lossess are adjusted in feval(), the printed losses per layer will be unaffected, so something more is still needed to print the adjusted losses in maybe_print().

bmaltais commented 8 years ago

Tried to change the feval() mod.loss value but it does not change the resulting image... just the reported loss values. Too bad it is not possible to change weights for every iteration. Might be that the best is to run two pass. One to get an estimate of the needed weights and a 2nd one to generate the final result.

mystcat commented 8 years ago

Out of the curiosity - what layers do you use? Originally there were just

  1. You have 10.

On Monday, June 13, 2016, bmaltais notifications@github.com wrote:

Tried to change the feval() mod.loss value but it does not change the resulting image... just the reported loss values. Too bad it is not possible to change weights for every iteration. Might be that the best is to run two pass. One to get an estimate of the needed weights and a 2nd one to generate the final result.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/neural-style/issues/237#issuecomment-225728804, or mute the thread https://github.com/notifications/unsubscribe/ARjIQ1c52YM9J1NT-PFKZwIp6U786Rfsks5qLdrlgaJpZM4IqVCw .

bmaltais commented 8 years ago

I used @htoytyle convis to look at the layers that reacted to the content image and selected all that actually reacted and left the other ones out

-style_layers relu0,relu1,relu2,relu3,relu5,relu6,relu7,relu8,relu9,relu10

Tuning the layers weight actually make a huge difference on the result. I was quite surprised.

mystcat commented 8 years ago

As I see from your tweaked settings some layers suppressed by few orders of magnitude comparing to others. You did this to lower losses on those layers but doing so seems to be roughly equivalent to excluding them. For example, layer 7 has weight factor of 0.005. You adjusted it because it had huge loss comparing to others. So according to your intuition if layer has bigger loss it should be suppressed. Basically it means you are trying to keep those style layers which "closer" to the original image. If style layer differs too much you are suppressing it. Generally such feature you are willing to add means it will try to choose layers with minimal losses. I'm curious about outcome as well.

mystcat commented 8 years ago

From other side adjusting layer weights on the fly means that contribution of each participating layer will be equalized so resulting image will be affected by gradients from all participating layers equally. Nice idea actually.

bmaltais commented 8 years ago

@mystcat If you look at the twitter link I posted above you can see the result of tweaking the the weights. I agree that it is almost like suppressing some of them... but there are still some influence left and I think the neural network can still account for them in some ways. The resulting image is actually pretty good looking compared to the unweighted values.

Here are the actual parameters I used:

time th neural_style.lua \
-style_scale 1 -init image \
-style_image ../in/pic/src.jpg -content_image ../in/pic/dst4.jpg \
-output_image ../in/pic/outdst4d.jpg \
-image_size 1000 -content_weight 0.000001 -style_weight 100000 \
-save_iter 50 -num_iterations 1000 \
-model_file models/nin_imagenet_conv.caffemodel \
-proto_file models/train_val.prototxt \
-content_layers relu0,relu1,relu2,relu3,relu5,relu6,relu7,relu8,relu9,relu10 \
-style_layers relu0,relu1,relu2,relu3,relu5,relu6,relu7,relu8,relu9,relu10
htoyryla commented 8 years ago

bmaltais notifications@github.com kirjoitti 14.6.2016 kello 1.34:

Tried to change the feval() mod.loss value but it does not change the resulting image... just the reported loss values. Too bad it is not possible to change weights for every iteration. Might be that the best is to run two pass. One to get an estimate of the needed weights and a 2nd one to generate the final result.

mod.loss is the loss reported by the loss module, so changing it does indeed only change the printed value. What counts is what you add to the total loss.

But you are probably right that you indeed have to change the weights inside the loss module to have the correct effects, to get the gradients right. It should actually be possible to change the weights dynamically by changing mod.strength. Just note this strength, in my version, is the style_weight multiplied by the style_layer_weight.

htoyryla commented 8 years ago

mystcat notifications@github.com kirjoitti 14.6.2016 kello 2.25:

From other side adjusting layer weights on the fly means that contribution of each participating layer will be equalized so resulting image will be affected by gradients from all participating layers equally. Nice idea actually.

I tend to think that the differences between the loss values from different layers do not necessarily mean that one the output layer matches the target better, but that the architecture of the model and similar reasons also affect the loss range from each layer. If that is correct, it is really equalizing we are talking about, not excluding some layers as it might seem first.

But if the differences are due to architectural reasons, then the relative output ranges of each layer should not change so much from iteration to iteration, and adjusting the equalization at each iteration would not be needed. But anyway, bmaltais’ concept of doing the equalization automatically makes it easier to use, and in fact one might readjust the equalization less often than at every iteration.

mystcat commented 8 years ago

If I understand loss function correctly - it is exactly what it does. Style loss computes difference between gram matrices of content and style. So lower the difference better style match.

On Mon, Jun 13, 2016 at 11:32 PM, Hannu Töyrylä notifications@github.com wrote:

mystcat notifications@github.com kirjoitti 14.6.2016 kello 2.25:

From other side adjusting layer weights on the fly means that contribution of each participating layer will be equalized so resulting image will be affected by gradients from all participating layers equally. Nice idea actually.

I tend to think that the differences between the loss values from different layers do not necessarily mean that one the output layer matches the target better, but that the architecture of the model and similar reasons also affect the loss range from each layer. If that is correct, it is really equalizing we are talking about, not excluding some layers as it might seem first.

But if the differences are due to architectural reasons, then the relative output ranges of each layer should not change so much from iteration to iteration, and adjusting the equalization at each iteration would not be needed. But anyway, bmaltais’ concept of doing the equalization automatically makes it easier to use, and in fact one might readjust the equalization less often than at every iteration.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/neural-style/issues/237#issuecomment-225771123, or mute the thread https://github.com/notifications/unsubscribe/ARjIQ04EVvP2cgSJkE1ZuidOt9HVXlWsks5qLiDTgaJpZM4IqVCw .

htoyryla commented 8 years ago

mystcat notifications@github.com kirjoitti 14.6.2016 kello 6.44:

If I understand loss function correctly - it is exactly what it does. Style loss computes difference between gram matrices of content and style. So lower the difference better style match.

I have understood that for content layers, content image is fed into the model once for each content layer, and the output of the layer is stored as the target. The loss is then the "difference" between the current image and the target.

For style layers, it works similarly except that the Gram matrix is added on top of the layer, to give a representation of style instead of content. Feed the style image into the model and store the output of the layer's Gram matrix as the style target to be used in calculating the loss.

So the content and style are not compared, and the gram matrix is not used for content. Layer output is compared with the target obtained using the content image => content loss, and Gram matrix output is compared with the target obtained from the same layer's Gram matrix using the style image => style loss.

But then yes, the lower the loss, the closer to the target.

mystcat commented 8 years ago

I am sorry but to my understanding this statement is far from truth:

So the content and style are not compared, and the gram matrix is not used for content

This exact line calculates gram matrix for the current image (eg content being processed): self.G = self.gram:forward(input)

And this line calculates difference between style gram matrix and current image gram matrix: local dG = self.crit:backward(self.G, self.target)

Criterion is standard MSE

Style layer gram matrices are calculated during model initialization and do not change during optimization process. Content gram matrices are being calculated at each iteration and then compared to style gram matrices which are target.

htoyryla commented 8 years ago

mystcat notifications@github.com kirjoitti 14.6.2016 kello 7.48:

I am sorry but to my understanding this statement is far from truth:

So the content and style are not compared, and the gram matrix is not used for content

This exact line calculates gram matrix for the current image (eg content being processed): self.G = self.gram:forward(input)

And this line calculates difference between style gram matrix and current image gram matrix: local dG = self.crit:backward(self.G, self.target)

Criterion is standard MSE

Style layer gram matrices are calculated during model initialization and do not change during optimization process. Content gram matrices are being calculated at each iteration and then compared to style gram matrices which are target

We are using the word content in different senses. When I say content, I mean the -content_image which is the target for content only.

When you say content, you talk about the output image which is being generated. Of course the STYLE of the image being generated is compared with the style image using Gram matrices.

Considering that the goal of neural-style is for combining content of one image with the style of another, i.e. that content and style are two separable elements, it is confusing to use the term content also for the output image. This is why I misunderstood you. Gram matrices are for separating style from content, which it didn't make sense to use Gram matrix for content.

mystcat commented 8 years ago

STYLE of the image being generated

The image being generated based on content by definition. Optimization process "bends" or "pushes" content image to match style image. It uses gram matrices of both to calculate the difference. Yes it doesn't calculate gram matrices of original content image (except for the first iteration). Only from the one being optimized.

There are only content and style images as algorithm input (aside from parameters). So result is content pushed towards style. In that sense image being optimized is content. We do not change style.

htoyryla commented 8 years ago

I guess you are thinking in terms of using -init image. Then the original content image is gradually modified, yes.

The default is starting from random noise, which is pushed towards both style and content targets.

Anyway, if you read the original paper, it is about content and style as elements of an image which can be separated and combined. In the context of neural-style and related methods, the concepts of content and style as the constituent elements of every image are essential. The core idea behind neural-style is that content loss can be measured from a conv layer and style loss can be measured from a Gram matrix appended on top of a conv layer. In this sense gram matrix is indeed not used for content, as its function is to extract a representation of style. Only confusion will result if "content" is used ambiguously.

It will be difficult to continue a discussion if you insist on another terminology. I guess there is no point in arguing about this any longer.

mystcat notifications@github.com kirjoitti 14.6.2016 kello 8.01:

STYLE of the image being generated The image being generated based on content by definition. Optimization process "bends" or "pushes" content image to match style image. It uses gram matrices of both to calculate the difference. Yes it doesn't calculate gram matrices of original content image (except for the first iteration). Only from the one being optimized.

There are only content and style images as algorithm input (aside from parameters). So result is content pushed towards style. In that sense image being optimized is content. We do not change style.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

htoyryla commented 8 years ago

@bmaltais , I think this implements your idea inside feval():

  local num_calls = 0
  local function feval(x)
    num_calls = num_calls + 1
    net:forward(x)
    local grad = net:updateGradInput(x, dy)
    local loss = 0
    for _, mod in ipairs(content_losses) do
      loss = loss + mod.loss
    end
    local slosses = torch.Tensor(#style_losses)
    for i, mod in ipairs(style_losses) do
      loss = loss + mod.loss
      slosses[i] = mod.loss
    end
    local minloss = torch.min(slosses)
    for i, mod in ipairs(style_losses) do
      mod.strength = (minloss / mod.loss) * params.style_weight  
    end
    maybe_print(num_calls, loss)
    maybe_save(num_calls)

    collectgarbage()
    -- optim.lbfgs expects a vector for gradients
    return loss, grad:view(grad:nElement())
  end

params.style_weight is used here

      mod.strength = (minloss / mod.loss) * params.style_weight  

because minloss/mod.loss will be 0..1 and mod.strength should be = style_weight * layer_weight.

This runs but is not quite stable

Iteration 80 / 1000 
  Content 1 loss: 130318.955078 
  Style 1 loss: 5.019568    
  Style 2 loss: 336.770164  
  Style 3 loss: 2149.705239 
  Style 4 loss: 528.574245  
  Style 5 loss: 2611.889076 
  Total loss: 135950.913371 
Iteration 81 / 1000 
  Content 1 loss: 128694.267578 
  Style 1 loss: 27554.019165    
  Style 2 loss: 4160.545138 
  Style 3 loss: 352.721948  
  Style 4 loss: 97156.932369    
  Style 5 loss: 5.016697    
  Total loss: 257923.502894 

Using

      mod.strength = (minloss / mod.loss) * mod.strength  

keeps style losses amazingly well, perhaps too well, equalized:

Iteration 7 / 1000  
  Content 1 loss: 2091152.343750    
  Style 1 loss: 656.860093  
  Style 2 loss: 656.860093  
  Style 3 loss: 656.860093  
  Style 4 loss: 656.860010  
  Style 5 loss: 656.859997  
  Total loss: 2094436.644035    
Iteration 8 / 1000  
  Content 1 loss: 2091147.031250    
  Style 1 loss: 656.859997  
  Style 2 loss: 656.859997  
  Style 3 loss: 656.860050  
  Style 4 loss: 656.859956  
  Style 5 loss: 656.859997  
  Total loss: 2094431.331247    

Let's see what kind of images this produces. I still think that it might be better to equalize only at every n-th iteration.

PS. Sorry about closing the issue. Clicked on a wrong button.

PPS. By the way, this modification is not dependent on my style_layer_weights, as far as I can see.

htoyryla commented 8 years ago

Equalizing at every n-th iteration is simple:

    if (num_calls % 50 == 1) then
      local minloss = torch.min(slosses)
      for i, mod in ipairs(style_losses) do
        mod.strength = (minloss / mod.loss) * mod.strength  
      end
    end

What worries me that the style losses stay constant, not only equalized but really almost constant. It would seem to me that this is not good for the gradients. Have to see how this works. It is possible that I have a bug somewhere.

PS. It looks like that the losses remaining almost unchanged happened during the very early iterations. They do change later. The question is then how often the equalization should be done.

htoyryla commented 8 years ago

I tested how this reacts to increasing style_weight. The losses do increase correspondingly, but one gets "function value changing less than tolX" very easily. Probably further experimentation is needed to get the feedback loop working without side-effects.

I will now move back to other things. If you want to play with this, have fun.

bmaltais commented 8 years ago

Wow, a lot of work last night on this. I did a quick test and this is a whole new territory to explore. Applying the balancing at every iteration has some major effect. Even applying only on nth iteration is significant. It will get some more exploration to find if it is actually useful or useless as an option.

I will try to implement a gradual weight parameter such that the balancing effect will be way less and 1st and 100% on the last iteration. Perhaps this will help with the overall result.

Something like:

mod.strength = (((minloss / mod.loss) * num_calls / params.num_iterations) + (1 - (1 * num_calls / params.num_iterations))) * mod.strength

htoyryla commented 8 years ago

bmaltais notifications@github.com kirjoitti 14.6.2016 kello 15.28:

I will try to implement a gradual weight parameter such that the balancing effect will be way less and 1st and 100% on the last iteration. Perhaps this will help with the overall result.

Something like:

mod.strength = (minloss / mod.loss) * mod.strength * num_calls / params.num_iterations

Sounds interesting but be careful. Mod.strength is the total weight of the style for this layer, so this line would cancel any effect of style initially. To introduce equalization gradually, you need a calculation that will leave mod.strength unchanged when num_calls is small.

mystcat commented 8 years ago

@htoyryla

if you read the original paper

Yes I did. As many researchers proved later, authors were wrong about at least two things:

So we are speaking about "-init image" here by definition - common sense after millions of generated images. Please don't appeal to original paper if you are not following it thoroughly.

Ability to admit his own mistake is a quality of strong minds. It is indeed easier to just close discussion if you are running out of arguments.

I agree it doesn't worth time to continue this argument. I just wanted to point out that some claims were false. Didn't want to prove you anything. It doesn't worth time.

Thank you for trying proposed idea anyway.

htoyryla commented 8 years ago

@mystcat, you are defining "content" different from the sense of separating content and style. It is my mistake I did not realize that before my comments.

My description of the process is still valid. The core idea behind neural-style is that content loss can be measured from a conv layer and style loss can be measured from a Gram matrix appended on top of a conv layer. In this sense gram matrix is indeed not used for content, as its function is to extract a representation of style. There is nothing wrong in what I wrote, writing about content and style as two elements in any image. You understood content as an image to which style is being transferred. Had I realized that before my comment, I wouldn't have written it.

If you still don't realize this, but continue to call my claim "false", then we indeed have no basis for a discussion. Content in the sense opposed to style is indeed not measured by a Gram matrix. The style of an image, which of course can also be called content, is of course measured by a Gram matrix.

Whether or not -init random or -init image works better is a side issue. I still prefer the more unambiguous usage of content and style as two elements in any image, and refer to the images by more specific terms, such as content image, style image and output image. This is simply because it is safer, not tied to any particular method or implementation.

I am unhappy with the turn this thread has taken. I am not a professional researcher in this field, just have a long varied background in IT, telecommunications and software industry and currently an interest in artistic applications for my own use. I usually prefer to participate in communities where my contribution is respected, otherwise I am wasting my time and effort.

mystcat commented 8 years ago

Thank you for explanation. It makes sense. Indeed we operated different terminology.

Your contribution to this project is very valuable. Thank you for all your input. Appreciate that.

You helped myself to understand the subject better by learning from your experiments.

htoyryla commented 8 years ago

I'll need to be more alert concerning terminological issues. I used to be quite good at spotting them when I was working together with people in real life. Internet is much more difficult. I'm sorry for this.

Concerning -init image, I have noticed that people prefer it now, but I still use -init random almost always. Perhaps it is my style, I am not after photorealistic results, but more like art prints. I usually combine several images, perhaps add something by hand, and print them with an ink jet on fine art paper, when selecting the type of paper also significantly affects the look of the final image.

mystcat commented 8 years ago

By the way, it is an interesting outcome.

Would you like to figure out together what is style and what is content? What is the nature of "separation of content from style"?

According to authors they are discarding spatial part of style image and keep only the "style" part which is very subjective. It is how we perceive image regardless of spatial information. Authors proposed very interesting method for such style (perceptive part) extraction using gram matrices.

From other hand we have source image (which I called content) from which we are willing to keep only spatial information. But it is not completely true. We want to keep style of original image to some extent. For example, image perceived better if lips are reddish (assuming such colors exists on style image). It is exactly what algorithm does - it usually sticks similar colors from style to same colors in content. Thus it does keep style of content image, not only spatial part.

htoyryla commented 8 years ago

Now it IS getting interesting. I admit that my understanding of "style" in this context is defined by what the neural-style method, through the use of Gram matrices, achieves. But it is style only in a qualified sense. Definitely not what the art world would call the "style of an artist". I remember someone calling the neural-style style "color and texture".

Then you point out that also the style of the source image is to an extent copied to the output. Yes, I was thinking of it today... these stylistic features are present in the conv layer output which steers content loss. If they weren't, the Gram matrix wouldn't be able to extract them.

As to the process choosing correct colors for features and object... isn't it that it can derive from two sources. Either directly from the low level features of the source image, or from the high level features when the model "knows" that there are lips here and lips are red. Or perhaps both elements contribute together.

I would think that the choice of content layers affects to which extent and in which form the style of the source image is copied. Conv1_x is good for the colors, conv2_x for lines etc. Conv4_x and conv5_x know about more complex features.

These are just random thoughts, no claims made. I have recently been experimenting with neural-doodle, the forward branch, which uses patches instead of Gram matrices for style. It is quite different, sometimes quite good, even in cases when the result does not closely copy the style. I have understood that the patches are claimed to be better in that they can retain some of the spatial structure of the style.

You might also be interested in these experiments of mine http://liipetti.net/erratic/2016/03/28/controlling-image-content-with-fc-layers/ (see also the two sequels) where I modified neural-style so that even the spatial structure of the source image was lost, the output being generated merely on the basis of the labels from the FC-layers. In the sequel I then gradually re-inserted the spatial element to the content.

mystcat commented 8 years ago

Thank you Hannu,

You definitely experimented more with the subject than me. Can't add more. Field of view of neuron on conv4_1 is 68 pixels and 156 pixels on conv5_1 according to my calculations. So it is quite possible that there is a neuron for lips and other common high-level features. Nevertheless color matching works even if I exclude layers higher that conv3_x. My intuition is that those higher layers are not concerned about facial features much. I concluded it from pictures generated from random init using solely particular layers with suppressed content part (very high ratio between style and content). It produced patchy patterns without any recognizable elements aside from very tiny strokes and patterns. Here is an example from a conv4 layer: https://scontent-ord1-1.cdninstagram.com/t51.2885-15/s750x750/sh0.08/e35/12940324_600455020123341_563928147_n.jpg Style image was Picasso, Studio with Plaster Head.

htoyryla commented 8 years ago

It may well be that the default models know more about birds etc than faces, depending on the material they have been trained on. I was also surprised once how poorly the so called VGG_FACE perfomed with portraits. Perhaps it had not learned to generalize really?

Have you tried my https://github.com/htoyryla/convis . It should be straightforward to feed in a face and have a look what it highlights on a layer in a model.

PS. A feature highlight from VGG19 conv5_1 which comes closest to lips, although eyes are included as well. Most of the other maps make no immediate sense to me with regard to facial features.

saara2014b-conv5_1-62

Here's the mouth, or so it looks like, from conv5_2.

saara2014b-conv5_2-161

mystcat commented 8 years ago

Which particular neuron (n=?) from conv5_1 is that? I am playing with convis right now and have to say VGG19 is not as good at feature extraction as one may imagine. It does the task of object recognition very well but not good for particular meaningful feature extraction (like facial features). This one (attached) is the 281 layer of relu5_1 image2-relu5_1-281

Neuron activations on upper layers of VGG19 seems mostly chaotic to me. There are some with seemingly understandable purpose but most of them are not.

htoyryla commented 8 years ago

The eyes & mouth images is channel 62 (counted from 1) of conv5_1. With my image 281 looks a bit different, most of the face is highlighted but not all, and not as brightly.

I guess it is possible to classify objects quite well even if the features detected on conv layers are not meaningful as we see them.

I am a bit busy with other things now, will probably have more time next week.

mystcat commented 8 years ago

and not as brightly

I am using relu layer, not conv.

Neuron activations on upper layers of VGG19 seems mostly chaotic to me. There are some with seemingly understandable purpose but most of them are not.

Some interesting calculations for conv5_1 here:

So each neuron on conv5_1 layer covers more than a half of source image it was trained on. It is really hard to say what excites it in source image without tracing back all gradients. I'm afraid straightforward approach you used in convis is not enough to understand what is going on.

This project could be quite resourceful if you want to dig into this further: https://github.com/yosinski/deep-visualization-toolbox

mystcat commented 8 years ago

Here is my 62nd of relu5_1 image2-relu5_1-62

Looks like it is not quite lips+mouth specific.

mystcat commented 8 years ago

Here is what I get on relu5_1/62 if I use image 2 times bigger (448):

image2-relu5_1-62

Some eyes+mouth specific emerged -)

Suprizingly it activated on whole face of the guy on back scene.

htoyryla commented 8 years ago

mystcat notifications@github.com kirjoitti 14.6.2016 kello 21.21:

Some interesting calculations for conv5_1 here:

Layer size in pixels is only 14x14 for 224x224 image Field of view of each neuron is 156 pixels Each neuron steps (strides) 16 pixels when convolves (if we translate it back to source image) So each neuron on conv5_1 layer covers more than a half of source image it was trained on. It is really hard to say what excites it in source image without tracing back all gradients. I'm afraid straightforward approach you used in convis is not enough to understand what is going on.

I have had similar doubts, you expressed it more clearly. This project could be quite resourceful if you want to dig into this further: https://github.com/yosinski/deep-visualization-toolbox

I think this is on my waiting list. So much to do.

htoyryla commented 8 years ago

On 14.06.2016 21:31, mystcat wrote:

Here is what I get on relu5_1/62 if I use image 2 times bigger (448):

The image I used was still a bit larger, 662x681.

mystcat commented 8 years ago

relu5_1-62 896x896 - still eyes+mouth specific

image2-relu5_1-62

bmaltais commented 8 years ago

Hi guys. I have been playing with this all day and here is my final function:

  local function feval(x)
  num_calls = num_calls + 1
  net:forward(x)
  local grad = net:updateGradInput(x, dy)
  local loss = 0
  for _, mod in ipairs(content_losses) do
    loss = loss + mod.loss
  end
  local slosses = torch.Tensor(#style_losses)
  for i, mod in ipairs(style_losses) do
    loss = loss + mod.loss
    slosses[i] = mod.loss
  end

  local maxloss = torch.max(slosses)

  for i, mod in ipairs(style_losses) do
      mod.strength = params.style_weight * mod.loss / maxloss
      if (mod.strength < 1) then
        mod.strength = 1
      end
  end

  maybe_print(num_calls, loss)
  maybe_save(num_calls)

I may look strange at 1st to seek maximum loss but beleive it or not it is what gives the best results. Trying to raise the weight of layers with minimum loss actually result in really bad looking images.

The resulting images are not very different from the non modified version... but it is interesting to see how each layer get used for each iterations.

Here is an example result: https://twitter.com/netputing/status/742892268592336897 Using: https://twitter.com/netputing/status/742893635348242432

2nd image carries more of the original style than 1st but it is very subtle.

Code used to generate images (same for both):

th neural_style.lua -tv_weight 0.0001 -init image -style_image ../in/leo/src.jpg -content_image ../in/leo/dsto.jpg -output_image ../in/leo/res3-o.jpg -image_size 1200 -content_weight 0 -style_weight 100000 -save_iter 50 -num_iterations 1000 -model_file models/nin_imagenet_conv.caffemodel -proto_file models/train_val.prototxt -content_layers relu0,relu1,relu3,relu5,relu7,relu9 -style_layers relu0,relu1,relu3,relu5,relu7,relu9