Open ProGamerGov opened 7 years ago
Each feature map of the model sees the image differently. One sees horizontal lines, another vertical lines, others see diagonal lines, circles, boxes, windows, eyes and so on. A layer may consist of as many as 512 feature maps, which respond to different features. Combining them does not sound a good idea to me, just like I wouldn't put 512 photos from London on top of each other to show what London is like.
My main idea in making convis was to be able to check, when training a model, how the training is succeeding. One can also use it to gain some understanding of what a model sees. But the feature maps respond to thousands of different features, and I don't see how one could compress that into a heat map in a meaningful way. To understand the layers, one would have to feed the model different kinds of images and the examine all the feature maps to determine what features exactly each feature map is seeing. But convis is probably too simplistic for that kind of work.
I recently noticed that MIT's Places 365 models were used to generate saliency maps: http://cnnlocalization.csail.mit.edu/
That is exactly what I was trying to do here with convis. I wonder if we can apply class activation mapping (CAM) to other models or if it's specific to the Places 365 project?
I found an implementation of CAM that works on the regular caffemodels that Neural-Style uses: https://github.com/ramprs/grad-cam
Though that implementation only supports a single layer at a time. It would be interesting to see how the heatmap changes between iterations in Neural-Style.
So classification.lua contains the code, along with: utils.lua
Specifically these two functions are used to create the heatmap:
https://github.com/ramprs/grad-cam/blob/master/misc/utils.lua#L84-L128
https://github.com/ramprs/grad-cam/blob/master/misc/utils.lua#L154-L176
Edit:
@htoyryla I can't seem to figure out how to get the code working in Neural-Style. I've been trying to place it all in the feval(x) function. Maybe it needs to be implemented like a loss function to work correctly?
This might work?
self.gradInput2 = self.crit:forward(input, self.target)
self.activations = self.gradInput2:squeeze()
self.gradInput = self.crit:backward(input, self.target)
self.gradInput:zeroGradParameters()
self.gradients = self.gradInput:squeeze()
self.weights = torch.sum(gradients:view(activations:size(1), -1), 2)
self.map = torch.sum(torch.cmul(activations, weights:view(activations:size(1), 1, 1):expandAs(activations)), 1)
self.map = map:cmul(torch.gt(map,0):typeAs(map))
Maybe something like the TV Loss function:
local HeatmapLoss, parent = torch.class('nn.HeatmapLoss', 'nn.Module')
-- Heatmap CAM
function HeatmapLoss:__init()
parent.__init(self)
self.loss = 0
self.gradients
self.activations
self.weights
self.map
self.crit = nn.MSECriterion()
self.target = torch.Tensor()
end
function HeatmapLoss:updateOutput(input)
self.output = input
return self.output
end
function HeatmapLoss:updateGradInput(input, gradOutput)
self.gradInput:resizeAs(gradOutput):copy(gradOutput)
if input:nElement() == self.target:nElement() then
self.gradInput2 = self.crit:forward(input, self.target)
self.activations = self.gradInput2:squeeze()
self.gradInput = self.crit:backward(input, self.target)
self.gradInput:zeroGradParameters()
self.gradients = self.gradInput:squeeze()
self.weights = torch.sum(gradients:view(activations:size(1), -1), 2)
self.map = torch.sum(torch.cmul(activations, weights:view(activations:size(1), 1, 1):expandAs(activations)), 1)
self.map = map:cmul(torch.gt(map,0):typeAs(map))
end
return self.map
end
It seems to me that you are trying to achieve both a) get a heatmap that somehow combines all the featuremap activations from a given layer b) monitor how this heatmap changed during iterations
I don't think there is any major difficulty doing this in neural-style, one simply needs to find a good way to combine, say, 128 feature maps into a single heatmap. Like taking the average or maximum of all feature maps from a layer.
I did something related in this in one of our earlier threads here. This does not display feature maps though, but the gradients at each iteration, as if to indicate which part of the image is now changing and how much.
This gist contains modified neural-style to display the total gradient which is converted to an image by normalizing to range 0 .. 255. It just gives a rough visual indication how and weher the image is changing during the iteration process. https://gist.github.com/htoyryla/445e2649293f702a940c58a8a3cef472
Convis was made simple to just map the activations. The more sophisticated visualization methods attempt to follow gradients to indicate exactly which areas in the image caused those activations. I guess your difficulties arise from the need to make neural-style do both the usual iterations and to trace the gradients for visualization. Good luck.
I made a quick test modifying convis to save a single combined activation map from a layer. Easy to do, it is only an open question as to how meaningful such a map is as the different channels respond to different features, so it is quite natural that the combined activations from a layer cover most of the image.
Another thought: you cannot do this inside feval, because the output of each layer is not available there. However, one could calculate the activation map inside each style (and content) loss module and store the results inside feval. So when using the simple activations like in convis, no additional loss modules are needed. And I don't have time or interest to start looking into this following the gradients thing.
OK, I can see how the simple combined activation maps could be useful.
Relu1_1
Relu2_1
Relu3_1
Relu5_3
Just as a sidetrack... so my convis shows me the activations from each individual filter in a VGG network. Now I noticed that the second filter of relu1_1 of the usual VGG19 reacts mainly on the sky in the default Thuringen image. So taking the output from that feature map and adding some postprocessing I can get masks like these. The point here is that the filters in relu1_1 act directly on the image and therefore can also be used as ordinary image filters (if they happen to produce useful output, that is).
Could you please share the convis modifications for creating simple combined activation maps?
I am also wondering how torch.max can be used to get a predicted class value, when no classification list is provided. The code lines here seem to do this and I can't seem to recreate it in convis or Neural-Style.
Like for example:
local y = net:forward(img)
local score, pred_label = torch.max(y,1)
label = pred_label[1]
print("Predicted label: ", pred_label)
This seems like it might be interesting to use for models that don't readily available category lists.
Another idea I just had was what if instead of arbitrary restricting Neural-Style's layer channels to specific values, we could instead restrict it to what matches the most likely label. Is is possible to get a list of layers and their filters using the above code?
Edit:
I figured it, it was really simple:
local width, height = 224, 224
content_image = image.scale(content_image, width, height)
local cnn = loadcaffe.load(params.proto, params.model, "nn"):float()
local y = cnn:forward(img)
local score, pred_label = torch.max(y,1)
print("Predicted label: ", pred_label)
I also see your neural_mirage5.lua does not resize the image before making the predictions? Is the above method basically the same as yours, only it uses torch.max to get the label with the highest accuracy prediction, whereas your code checks every label and creates a top 5 set of labels.
@htoyryla For the mask image you created with relu1_1, I guess that particular filter was looking for a "sky texture"?
And in the context of our conversations here, "filter" and "layer channel", are the same thing, right?
For the mask image you created with relu1_1, I guess that particular filter was looking for a "sky texture"?
Not really, the lowest levels cannot detect complex entities like "sky", they simply act as basic convolutional filters. It could be that it detects a certain color.
And in the context of our conversations here, "filter" and "layer channel", are the same thing, right?
Yes. Functionally they are filters. In neural-style/torch terms, a channel in a layer.
Is the above method basically the same as yours, only it uses torch.max to get the label with the highest accuracy prediction, whereas your code checks every label and creates a top 5 set of labels.
Yes. It shows the top 5 labels. In addition, when neural-mirage creates a new image, the target is the complete set classification probabilities, not only a single class with the highest probability. It tries to create an image that gives the same mix of label probabilities.
Note also that neural-mirage modifies the model (add an adaptive pooling layer between the conv and FC layers) so that the FC layers can be used with images of varying size. Therefore no resize is used before prediction.
Another idea I just had was what if instead of arbitrary restricting Neural-Style's layer channels to specific values, we could instead restrict it to what matches the most likely label. Is is possible to get a list of layers and their filters using the above code?
No. One has to look at each filter at each layer to see which activations are essential. Perhaps one could follow the gradients from the classification downward to see which filters contribute more and which less. That's not something I am familiar with. Anyway, also the lower activations may be significant and dropping those filters may change the results.
This is the (quick & dirty) code I used to make an average activation map from a given layer. I am simply taking the CxHxW output from the layer and summing the different channels which gives a HxW tensor, then normalizing it to 0...255 value range for display.
If I remember correctly, the modified part starts at the line local fmaps = net:forward(img)
require 'torch'
require 'nn'
require 'image'
require 'loadcaffe'
function preprocess(img)
local mean_pixel = torch.DoubleTensor({103.939, 116.779, 123.68})
local perm = torch.LongTensor{3, 2, 1}
img = img:index(1, perm):mul(256.0)
mean_pixel = mean_pixel:view(3, 1, 1):expandAs(img)
img:add(-1, mean_pixel)
return img
end
function deprocess(img)
local mean_pixel = torch.DoubleTensor({103.939, 116.779, 123.68})
mean_pixel = mean_pixel:view(3, 1, 1):expandAs(img)
img = img + mean_pixel
local perm = torch.LongTensor{3, 2, 1}
img = img:index(1, perm):div(256.0)
return img
end
local cmd = torch.CmdLine()
cmd:option('-image', 'examples/inputs/tubingen.jpg')
cmd:option('-output_dir', 'convis', 'directory where to place images')
cmd:option('-image_size', 800, 'output image size')
cmd:option('-proto', 'models/VGG_ILSVRC_19_layers_deploy.prototxt')
cmd:option('-model', 'models/VGG_ILSVRC_19_layers.caffemodel')
cmd:option('-layer', 'relu4_2', 'layer for examine')
local params = cmd:parse(arg)
local content_image = image.load(params.image, 3)
content_image = image.scale(content_image, params.image_size, 'bilinear')
local content_image_caffe = preprocess(content_image):float()
local img = content_image_caffe:clone():float()
local cnn = loadcaffe.load(params.proto, params.model, "nn"):float()
local net = nn.Sequential()
for i = 1, #cnn do
local layer = cnn:get(i)
local typ = torch.type(layer)
local name = layer.name
print(name, typ)
net:add(layer)
if (name == params.layer) then break end
if (i == #cnn) then
print("No such layer: "..params.layer)
return
end
end
local fmaps = net:forward(img)
local n = fmaps:size(1)
local filename = "c9out.png" --params.output_dir .. "/" .. string.sub(params.image:match("[^/]+$"), 1, -5) .. "-" .. params.layer
local y = torch.sum(fmaps, 1)
local m = y:max()
y = y:mul(255):div(m)
local y3 = torch.Tensor(3,y:size(2),y:size(3))
local y1 = y[1]
y3[1] = y1
y3[2] = y1
y3[3] = y1
local disp = deprocess(y3:double())
disp = image.minmax{tensor=disp, min=0, max=1}
disp = image.scale(disp, content_image:size(3), content_image:size(2))
image.save(filename, disp)
print("saving image ",filename)
In this comment here, I noted that the FCN-32s PASCAL model creates grey rectangle artifacts.
Image size 512:
Image size 1536:
I used a modified version of your convis.lua: https://gist.github.com/ProGamerGov/8f0560d8aea77c8c39c4d694b711e123
Then I just averaged all the layer output together with:
convert <layer1.png layer2.png layer3.png layer4.png> -average average_layers.png
Do you think that this has something to do with the artifacts? None of the other models I tested have anything like this, and the angles match the artifact's angles.
You mean the added frame around the image. I think that comes from the 100 pixel padding used in the model, see https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/voc-fcn32s/val.prototxt#L27
I think there are ways to modify the model to remove the padding, I haven't done exactly this kind of operation thought. It is probably easier to try modifying style loss modules to remove the padding before calculating the Gram matrix. Almost started trying this but it was not as straightforward either: one has to adjust to how the size of the feature maps changes in different layers.
In fact it is quite easy to remove the padding. Load the model into th, take the first layer and set padH and padW to zero (for instance). But one cannot save into a caffemodel from torch. I guess there are tools in caffe to do this though, but I haven't used them.
But one can do this at runtime like this:
if next_content_idx <= #content_layers or next_style_idx <= #style_layers then
local layer = cnn:get(i)
local name = layer.name
local layer_type = torch.type(layer)
local is_pooling = (layer_type == 'cudnn.SpatialMaxPooling' or layer_type == 'nn.SpatialMaxPooling')
--remove extra padding in fcn32s model
if i==0 then
layer.padH = 0
layer.padW = 0
end
if is_pooling and params.pooling == 'avg' then
@htoyryla I'm just curious if the padding is somehow the cause of the artifacts I experience. If it is, then I wonder what other parts of a model may cause artifacts. If parts of other models do cause artifacts, then maybe they can be removed by editing Neural-Style, or the model itself.
Also, do you have any idea where I should start if I want to record information about the individual filters and their activations so that I can generate a list of usable layer channels?
Could I use your convis tool to generate all the images for each filter/channel, and repeat that on multiple images of a specific category. Then I could run some sort of analysis on those channel/filter images for light and dark pixels. Would this be a viable idea? I imagine that more bright pixels equals better/more activations for each filter?
Just modify neural-style to remove the padding by adding these lines
if i==0 then
layer.padH = 0
layer.padW = 0
end
and see if it makes a difference.
Also, do you have any idea where I should start if I want to record information about the individual filters and their activations so that I can generate a list of usable layer channels?
You seem to be asking for a simple way to do something which is quite complex.
Yet, as a second thought, the most relevant channels are probably those with the strongest activations for the relevant images (both content and style). We could feed in an image and calculate some statistics on each channel, and then list the channels with the strongest activations.
Mere average would be too crude: it dismisses channels with strong activations within a smaller area. But it could be a way to start. Or taking the maximum. One can then try to find a better formula to measure the activations. Perhaps something like number of pixels with activation above a threshold?
Let's see if I can try this approach, seems interesting to try. In fact, if one can define a criteria for dropping a channel, based on low activations from the style image, one can do it automatically. Just give a threshold and the style loss calculation will ignore channels which do not respond well enough to the style image.
Try this https://gist.github.com/htoyryla/49cb3ab0864d2a12f558631c7b3d87a3
Give a layer and an image (for use with neural-channels probably should be the style image) and you get a list of channels which might be the best ones suitable. Param nc specifies how many are listed.
My neural-channels.lua is not the best way to make use of this anymore, as it in practice works only with a single style layer (because you cannot make channel selections per layer). It would probably be best to include this "channel pruning" into neural style, so that when the style target is capture, each style loss module evaluates which are the best channels and the uses only those when calculating loss. Seems quite strightforward.
Here's hopefully a working version, that tests the model during style capture and selects, per style layers, nc channels with the strongest activations (as measured during torch.norm of the channel output). These channels are then favored during iterations similar to how the earlier neural_channel worked. The rest the channels is not ignored totally (as this would stop the iterations from working) but given lower weight.
Remember that when decreasing nc, you need to increase style_weight yourself to keep the same content-style balance.
https://gist.github.com/htoyryla/b7940d31d329ee6ffb67b3185f414b8e
I'm noticing that the loss values with 10 of the best channels for each layer with neural_bestchannels.lua
, resemble the loss values you see in the later stages of multiscale resolution.
I had a theory that channels/filters with strong activations result in a high degree of stylization while channels/filters with weak activations result in a low degree of stylization.
I first noticed this clearly (I has suspicions about it from neural-channels.lua
) in my Protobuf-Dreamer project. For example, you can see that different channels of the mixed5a_1x1 layer, have different intensities of activations: https://i.imgur.com/icJjqm9.png
The difference is especially apparent on channel 106 (left), and channel 184 (right), where this was the input image:
While the inception5h model used in Protobuf Dreamer uses the inception architecture and not the VGG architecture that Neural-Style uses, I have suspect that the two are similar in regards to these high and low activation channels. Playing around with neural-channels.lua, it looked like I could influence the degree of stylization with by only changing the channel values. While testing my fine-tuned models, I noticed what appeared to be a similar effect:
What's interesting here, is that the degree of stylization is less with one style image, and more with another style image. The parameters never changed, but the channels/filters in the model did. I think this also backs up my theory.
Because different channels have different activation strengths, I wonder what would happen if instead of giving the strongest channels a higher weighting, we instead tried to make every channel equal to every other channel regardless of activation intensity. Like for example, we gave the weakest channels higher weights relative to the strongest channels.
For convis, I noticed that the Illustration2vec model's activations, resemble the model's "style". Compared to other models, the Illustration2vec model transfers styles with a very distinct anime style of it's own.
It seems to "see" every input image in an anime style. This is most apparent on input images with faces (especially the eyes).
I wonder how well placing an emphasis on the best content layer channels in addition to the best style layer channels would work? How would just placing an emphasis on the content layers compare to just placing an emphasis on the style layers?
I think I got neural_bestchannels.lua
to do the same thing with the content layer(s): https://gist.github.com/ProGamerGov/ef79cc3d47f6647f8f5a1582a657ce3d
I tried using:
if i==0 then
layer.padH = 0
layer.padW = 0
end
And it did not stop the artifacts from the FCN-32s PASCAL.
These are the results from my experiments with bestchannels.lua
, style channels only, and the default channel weighting: https://imgur.com/a/yxnZm
This is the result from using style and content channels, in addition to the default channel weighting:
These are the results from using style and content channels, and custom channel weighting values: https://imgur.com/a/LVekL
And this was the control test: https://i.imgur.com/kFUEZK0.png
Of course, the same parameters were used for each test, and only the -nc
parameter along with the channel weighting value, was changed.
The style image was examples/inputs/starry_night_google.jpg
and the content image was examples/inputs/hoovertowernight.jpg
, for all of the above experiments.
Changing the channel weighting was done by changing the value in this line of code: local m = torch.Tensor(C,H,W):fill(0.2)
, from the inputMask
function.
These results are certainly interesting, but I am having a hard time quantifying the differences in a meaningful way that makes sense based on the chosen parameters. Things will probably become more clear as I experiment with other style images, content images, and models.
Changing the channel weighting was done by changing the value in this line of code: local m = torch.Tensor(C,H,W):fill(0.2), from the inputMask function.
This line defines the default weight of the channels. The weight of the selected channels is set here:
m[sch] = 5
Ideally, I think, one would set the default weight to zero. When I was testing the original neural_channels, however, the iterations failed if the default weight was zero. The matrix became too sparse, I guess. But then I was testing with a single channel. With nc=10 I guess the default weight could be much smaller, like your experiment shows.
Remember also that tampering with channel weights changes the effective style weight, and so does changing nc, too. Which makes testing a bit uncertain.
Because different channels have different activation strengths, I wonder what would happen if instead of giving the strongest channels a higher weighting, we instead tried to make every channel equal to every other channel regardless of activation intensity. Like for example, we gave the weakest channels higher weights relative to the strongest channels.
This could be an interesting experiment, but the results could be quite erratic: we would be emphasising features NOT found in the images!
I think I got neural_bestchannels.lua to do the same thing with the content layer(s): https://gist.github.com/ProGamerGov/ef79cc3d47f6647f8f5a1582a657ce3d
I had to add the mode captureS to make sure that the styleLoss module captures the best channels from the style image, not from the content image. I think in the contentLoss module this danger does not exist.
But interesting idea... ignoring all but the strongest content features.
Ouch... there is bug in neural_bestchannels.lua so that no channels actually get emphasis. So the only thing that happens that the style weight is decreased.
https://gist.github.com/htoyryla/b7940d31d329ee6ffb67b3185f414b8e#file-neural_bestchannels-lua-L530
This line should be
if channels ~= nil then
I noticed this when I tested decreasing default channel weight to 1e-2 and then increasing the emphasis channel weight, to no effect on the losses. After the correction, when changing nc from 4 to 10, the effect on the losses is dramatic (having the same effect as increasing style weight).
Yea, I was looking over that part of the code earlier and wondering if it was indeed a bug. It looks like I did fix it myself, but that fix wasn't actually in the script I used to create the above examples... I couldn't find the style_channels
variable anywhere else in the code, but for whatever reason I assumed that there was something else going on that I was missing as I tried to follow the code (I guess I was more tired than I realized).
One observation: this method of using mainly nc channels per layer now appears to favor relu1_x layers, which now have the highest loss values, while previously I think relu3_x was the strongest.
This is probably because relu1_x has fewest channels, so dropping most channels off has a smaller effect than on higher levels. But it might be good to test also without relu1 layers.
I couldn't find the style_channels variable anywhere else in the code,
In neural_channels style_channels contained the layers given in the parameter style_channels, which was now replaced by the automatically detected best channels. I had only overlooked this if statement.
Because different channels have different activation strengths, I wonder what would happen if instead of giving the strongest channels a higher weighting, we instead tried to make every channel equal to every other channel regardless of activation intensity. Like for example, we gave the weakest channels higher weights relative to the strongest channels.
One might calculate the average norm for the channels, and then populate the channel mask with multipliers: (average norm / channel norm). This would in effect make all channels equally strong. Should be easily to implement, although the effect could be strange: we would be favoring features not present in the style model.
Meanwhile I made a more simple test: by just adjusting the code for inputMask as follows:
function inputMask(C, H, W, channels)
local t = torch.Tensor(C,H,W):fill(1)
local m = torch.Tensor(C,H,W):fill(2)
if channels ~= nil then
for i=1,#channels do
local sch = channels[i]
--print(i, sch)
if sch > C then
print("skipping non-existent channel ",sch)
end
m[sch] = 0.2
end
end
return t:cmul(m):cuda()
end
we can suppress a few of the strongest channels, while still keeping close to the original style. For instance I like this result using the defaults but suppressing 8 strongest channels: simple, not too much detail.
This https://gist.github.com/htoyryla/072e1f0475eebc9a4dfc0c011498da9c implements weighting each channel by (average norm of channels / norm of this channel).
Makes nothing dramatical, as far as I can see. It does not (as I may have thought) bring out features which are not in the style. I was thinking wrong: the process is still moving towards the style target. But what this may do is make finding the target more difficult, as those channels which contribute more to this style are attenuated. The weights affect how the steering wheel works, and we modify the weights to favor turns away from the target?
Which makes me think: could be make the search faster by doing the reverse, amplifying the already strong channels. I guess not... like when you increase the learning rate, you are likely to miss the target. Which again can be compared by turning the steering wheel too much each time.
I'm getting NANs from neural-equalchannels.lua
.
One observation: this method of using mainly nc channels per layer now appears to favor relu1_x layers, which now have the highest loss values, while previously I think relu3_x was the strongest.
This is probably because relu1_x has fewest channels, so dropping most channels off has a smaller effect than on higher levels. But it might be good to test also without relu1 layers.
We can only use the maximum number of channels in the lowest layer right now. But maybe we could counteract some of this favoritism by treating the layer normally when the number of channels is larger than what the layer has.
Using the modified bestchannels.lua
that supports both content and style layer channels, the results seem to make more sense now:
The weighting works correctly now as well it seems:
The control test:
Using different amounts of channels with the -nc
parameter results in a focus on different aspects of the style image and content image. Adding or subtracting by even a single value from the -nc
parameter, can result in a dramatic change on the resulting output image. For example, the control image has vivid spirals, while -nc 50
created some really nice wave like patterns instead. I also think that the style "flows" better in relation to the content image with -nc 50
, compared to the control image and other -nc
values.
I meanwhile have come to like more the approach of suppressing the strongest channels. Produces simpler, less detailed image (kind of follows better the large forms of the style).
Suppressing strong channels:
Suppressing weaker channels
We can only use the maximum number of channels in the lowest layer right now. But maybe we could counteract some of this favoritism by treating the layer normally when the number of channels is larger than what the layer has.
Don't immediately see how that would work, but never mind. You are free to try it. I was thinking rather that, so that the effect on effective style weight would be the same in all layers, one would suppress a given proportion of channels. E.g. nc would be given 1..64, and then it would be multiplied by C/64.
I'll have a look at neural-equalchannels in a moment.
I download neural-equalchannels from the gist and give the command (downloaded under a different name)
th neural-equalchannel-gist.lua -print_iter 1 -backend cudnn
and it iterates nicely. Using adam works too. But it can well be that it will not work with all models or in all cases. After all, equalizing the activations from all channels is a quite extreme idea.
Suppressing the stronger channels creates a result that looks a bit more like fast style transfer, especially in that last example you posted. I've only been messing around giving emphasis to the strongest layers. How well do the values work in your suppression code that was shared earlier in an above comment?
Personally, I feel I never got fast-neural-style to give anything this close to my styles. I am often after styles that are not too detailed, even towards abstract, and neural-style is not so good at it, and fast-neural-style was much worse. Now suppressing stronger channels look promising.
Here's my inputMask() for suppressing strongest channels. I usually set nc = 1 ... 10, at times 24 or 32. These values were intended for low values of nc. That's why I changed 5 to 2 for suppressing nc channels... not to upset the style-content balance too much.
function inputMask(C, H, W, channels)
local t = torch.Tensor(C,H,W):fill(1)
local m = torch.Tensor(C,H,W):fill(2)
if channels ~= nil then
for i=1,#channels do
local sch = channels[i]
--print(i, sch)
if sch > C then
print("skipping non-existent channel ",sch)
end
m[sch] = 0.2
end
end
return t:cmul(m):cuda()
end
Suppressing the weaker channels creates a result that looks a bit more like fast style transfer
Just to make sure... "suppressing weaker channels" is neural-bestchannels.lua as it is now in gist. The reverse approach would be suppressing stronger channels, with inputMask as in the comment above.
Just to make sure... "suppressing weaker channels" is neural-bestchannels.lua as it is now in gist. The reverse approach would be suppressing stronger channels, with inputMask as in the comment above.
I meant suppressing stronger channels.
I meant suppressing stronger channels.
The last example with suppressing stronger channels is with the lowest style weight. I think that makes it similar to fast-neural-style (with which one gets mainly color and texture effects while the shapes are not much affected... at least my impression of it).
I guess I did not try suppress weak channels with as low style weight at all. So to make a comparison ignore that example.
Here are the equalized content and style layer channel results:
I'm not sure what to say about the equalization results, but they are certainly different than all the previous tests.
And here's what happened when I suppressed the top 50 strongest channels on each layer for both the content and style layers:
I find it interesting that suppressing the top 50 strongest channels, helped the moon be transferred from the style image in a more complete form, than in the previous experiments.
(I hope that posting the images in this way, where you can click on them in order to get the full size, is better than creating really long/large comments filled with image)
Some experiments with different amount of channels for style and content layers, and experiments with equalizing either the content or style layer channels:
So all the "equalized" results that I have created seem to be have a flaw in the code that allowed for the loss values to not be NANs. I don't know what it is, but the equalization does not seem to work for me.
I'll have to play around with the parameters and see if that's the cause.
Edit:
I think one or both of these parameters are the cause:
-backend cudnn -cudnn_autotune
Removing both of them results in:
Capturing style target 1
relu1_1
relu2_1
relu3_1
relu4_1
relu5_1
Running optimization with L-BFGS
<optim.lbfgs> creating recyclable direction/step/history buffers
Iteration 50 / 1500
Content 1 loss: 4518842.968750
Style 1 loss: 1396907.775879
Style 2 loss: 734496046.875000
Style 3 loss: 916056093.750000
Style 4 loss: nan
Style 5 loss: 1462419.982910
Total loss: nan
Iteration 100 / 1500
Content 1 loss: 4518842.968750
Style 1 loss: 1396907.775879
Style 2 loss: 734496046.875000
Style 3 loss: 916056093.750000
Style 4 loss: nan
Style 5 loss: 1462419.982910
Total loss: nan
This repeating loss values are like the other issue I had earlier, but when I use -backend cudnn -cudnn_autotune
, I only get NANs for everything instead of just a few.
I am not surprised in equalizing channels produces NaNs, because also channels that do not respond to the image at all, or only very weakly, are pushed up to the same level as the strongest. I never thought that equalization makes sense, but tried it anyway.
What might work better is first suppressing sufficiently weak channels and then equalizing.
PS. noticed that you actually wrote "to not be NANs", which I do not understand, but anyway, pushing up even the channel that see nothing of interest is not very good for optimization. Maybe there is indeed a bug, that allows it to work at all.
A while back I was using equal content and style weights in order to see what artifacts a particular content or style layer would produce. I found that for the VGG-16 SOD Finetune model, two style layers in particular: relu2_1
, and relu3_1
, were responsible for almost all of the "artifacts". The relu3_2
content layer produced the least amount of artifacts compared to the other content layers.
Some examples:
https://i.imgur.com/wQlvFml.jpg
https://i.imgur.com/YbQrwXj.png
For the NIN model, I found that using -content_layers relu1,relu2,relu7 -style_layers relu1,relu2,relu3,relu5,relu7
, produced the least amount of artifacts. This worked really well in my tiling experiments, in addition to a -tv_weight
of 0.000001
(just high enough to destroy the remaining artifacts, but not high enough to affect the output that much).
For your goal of having less detail and a "simpler" look to your outputs, it might be useful to try and eliminate the "high noise" layers from the -content_layers
and -style_layers
input values. That is, if my idea of using equal content and style weight values, works for this sort of task. If this does work, then I wonder what the effects of using "low noise" layers together with channel manipulation are?
When using convis on a model's higher level layers, a large amount of individual images are produced. This makes trying to view the image as the model sees it, very impractical.
So I was wondering about the practicality of having a heat map that combines all of the images, into a single false color image?