jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.31k stars 2.7k forks source link

Where should I start if I want to train a model for usage with Neural-Style? #292

Open ProGamerGov opened 8 years ago

ProGamerGov commented 8 years ago

Where should I start if I want to train a model for usage with Neural-Style?

Are Network In Network (NIN) models easier to train than VGG models?

Does anyone know of any guides that cover training a model that is compatible with Neural-Style from start to finish? If not, then what do I need to look for in order to make sure the model I am learning to train is compatible with Neural-Style?

What is the easiest way to train a model for use with neural-style? Are there any AMIs available that will let me start messing around with training right away?

ProGamerGov commented 7 years ago

So this is a far superior batch image downloaded: https://chrome.google.com/webstore/detail/bulk-image-downloader/lamfengpphafgjdgacmmnpakdphmjlji?hl=en

The "fatkun batch download" can't handle more than 300-400 images without freezing. The one I linked in this comment can do over 3000+ images, but be warned that it does not save the images into a folder, and just dumps them into the downloads folder.

htoyryla commented 7 years ago

Thanks for the iter_size tip.

I have edited the posting below, after noticing that the prototxt I am using in the current training is not for VGG16 but a smaller 5-layer convnet.

I haven't tried training with Caffe for a long time, but decided now to try again. I am using my 2500+ photos of places which I have classified using Places205 model and using the scripts linked in a post above and renumbering the 168 labels found to the range 0...167. I am training a small convnet from scratch, using my own training prototxt, SGD optimizer and base_lr: 0.000005. The losses and accuracies are improving but very slowly, the accuracy after the first 24 hours is 0.1312.

Training a deep net like VGG16 or VGG19 is difficult. I have read that the original VGG16 and VGG19 were not trained in one go, but gradually training less deep versions first.

Often when my attempts fail totally to converge, there is something wrong with the labels. Like this time, in the first attempt, the num_output was wrong (I had forgotten to modify it to match the dataset) and the losses did not start to diminish, on the contrary. Also, it should be obvious that the training data, the images and the labels, should be such that learning is possible, that there are consistent, recognizable features. Thinking about a deep model with randomly set weights, it is a wonder that the learning can get started in the right direction at all.

ProGamerGov commented 7 years ago

Here is the result from 4600 iterations on a data set composed of approximately 3200 Deepart.io output images, and 3200 Ostagram output images. The Neural-Style output is 1500, and the model was a fine-tuned VGG16 SOD Finetune Model.

ProGamerGov commented 7 years ago

So I tried adding the following code to every except the required ones:

param {
    lr_mult: 0    #learning rate of weights
    decay_mult: 1
  }
  param {
    lr_mult: 0    #learning rate of bias
    decay_mult: 0
  }

As per the information I found here: https://github.com/BVLC/caffe/wiki/Fine-Tuning-or-Training-Certain-Layers-Exclusively

But I would always receive an error like "Expected ":" instead of "{"". I am unsure of how to resolve this issue so that I can experiment with only training a single layer at a time.

htoyryla commented 7 years ago

3.10.2016 23:49, ProGamerGov kirjoitti:

So I tried adding the following code to every except the required ones:

param { lr_mult: 0 #learning rate of weights decay_mult: 1 } param { lr_mult: 0 #learning rate of bias decay_mult: 0 }

As per the information I found here: https://github.com/BVLC/caffe/wiki/Fine-Tuning-or-Training-Certain-Layers-Exclusively

But I would always receive an error like "|Expected ":" instead of "{"|". I am unsure of how to resolve this issue so that I can experiment with only training a single layer at a time.

I have not seen it explained clearly anywhere, but there appear to exist two syntax variants of prototxt. Both work but cannot be mixed. One uses "layers" and the other "layer".

I am on thin ice now, but it could be that the equivalent of your definition would be in the other syntax:

blobs_lr : 0 blobs_lr: 0 weight_decay: 1 weight_decay: 0

That is, these lines only, without the param block. If you are modifying an existing prototxt, you should be able to see which alternative is being used. Stick to the same syntax.

htoyryla commented 7 years ago

I made a typo. Meant to say "without the param block". Only the four lines, no param block.

Cannot login right now to modify the comment.

ProGamerGov commented 7 years ago

@htoyryla

From previous testing, I found this interesting: https://i.imgur.com/XHg8CPA.jpg It appears that after 200 iterations, the edge detection starts being damaged by the fine-tuning. On all my training experiments, I have seen very similar results. Though usually the iterations where edge detection begins to break down are near the half way point, as opposed to so close to the start.

ProGamerGov commented 7 years ago

@htoyryla The difference between "layers" and "layers" is that "layers" is the outdated version of the prototxt. You can use upgrade_net_proto_text to update the prototxt file to the newer version.

cd ~ 

cd caffe

./build/tools/upgrade_net_proto_text vgg16_finetuned_train_val.prototxt vgg16_finetuned_train_val_out.prototxt

./build/tools/upgrade_net_proto_binary VGG16_SOD_finetune.caffemodel VGG16_SOD_finetune_out.caffemodel
ProGamerGov commented 7 years ago

I seem to have figured out how to change the output in an neutral manor that only affects the seed value in Neural-Style by fine-tuning the VGG-16 SOD Finetune model. Interestingly enough my data set was composed of art produced by neural networks.

Edit:

On closer inspection, it appears like the differences between the original and the fine-tuned version are in terms of smaller details. I only ran it for 600 iterations as I have to use AWS spot instances for this kind of stuff, but it looks like the newly fine-tuned model version produces more intricate details than the original model.

If I have achieved settings that result in an almost neutral change, then I can now theoretically change single parameters, target layers, etc... to achieve better artistic outputs.

ProGamerGov commented 7 years ago

So targeting specific layers seems to produce different output that are not worse than the original model's outputs. Really wish I had the resources to fully flesh this out, as it looks really promising for enhancing Neural-Style's outputs.

I think that by targeting different combinations of the default layers that Neural-Style uses, one can improve the model's ability in specific areas with the proper data set.

ProGamerGov commented 7 years ago

This prototxt here has been configured to stop learning on all layers by default: https://gist.github.com/ProGamerGov/1514d74dc6b799389875ce1764c1a12e

I was using the VGG16_SOD_finetune model: https://gist.github.com/jimmie33/509111f8a00a9ece2c3d5dde6a750129

And I ran ./build/tools/upgrade_net_proto_binary VGG16_SOD_finetune.caffemodel VGG16_SOD_finetune_out.caffemodel to convert the model to the latest version of Caffe.

You can allow learning on your layer of choice by changing the following lines of code on the desired layer:

  param {
    lr_mult: 0
    decay_mult: 1
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }

To:

param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }

The learning related values are from this Caffe guide here for training certain layers exclusively: https://github.com/BVLC/caffe/wiki/Fine-Tuning-or-Training-Certain-Layers-Exclusively

Another note is that edge detection abilities of the model do not seem to be positively or negatively impacted by this layer specific training.


I can also provide my two category Deepart.io and Ostagram data set which contains aproximately 3000 images for each of the two categories, if you want.

ProGamerGov commented 7 years ago

crowsonkb's style_transfer has an updated Amazon AMI, which has the latest version of Caffe already installed.

ProGamerGov commented 7 years ago

It looks like training a specific layer, or the default Neural-Style layers, requires a lot longer training time to notice major differences between the original and fine-tuned model.

Here are the results from some small scale experiments I ran using the newly found neutral training parameters on the upgraded model and protoxt files: https://i.imgur.com/k0jxvtv.png

ProGamerGov commented 7 years ago

So, just in case I am making the wrong assumptions, as per the prototxt file and Neural-Style's default layer related settings, the -content_layers and -style_layers map to the following prototxt layer names.

Prototxt Neural-Style
conv1_1 relu1_1
conv2_1 relu2_1
conv3_1 relu3_1
conv4_1 relu4_1
conv4_2 relu4_2
conv5_1 relu5_1

Or is Neural-Style using the part below each "conv" layer which has "relu" instead of "conv"?

Example of the prototxt layout:

layer {
  name: "conv1_1"
  type: "Convolution"
  bottom: "data"
  top: "conv1_1"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    pad: 1
    kernel_size: 3
    weight_filler {
      type: "gaussian"
      mean: 0
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "relu1_1"
  type: "ReLU"
  bottom: "conv1_1"
  top: "conv1_1"
}

The prototxt I was using can be found here: https://gist.github.com/ProGamerGov/1514d74dc6b799389875ce1764c1a12e

htoyryla commented 7 years ago

I am not fully sure I understand your question. Especially when you say 'is Neural-Style using the part below each "conv" layer which has "relu" instead of "conv"' Below in the sense "below in the prototxt file" or "in a lower layer".

But never mind. ReLU is really nothing more than an add-on function on top of a conv layer which sets all negative values to zero. This is why it is also called a rectifier. So in theory convx_y can output both negative and positive values, but after relux_y all negative values have been replaced by zero.

Furthermore, this discussion https://github.com/jcjohnson/neural-style/issues/93 hints that in an implementation such as Torch, the ReLU layer is actually perfomed in-place, which I read to mean that the ReLU directly modifies the memory containing the output of the convlayer. If this is true then there is actually no difference whether one uses conv or relu layers in neural_style, the ReLU function is there anyway, even if you access the conv layer.

jcjohnson commented 7 years ago

@htoyryla You are correct that ReLU is performed in-place in Torch so after a forward pass it doesn't matter whether you pick a conv layer or its associated ReLU layer; they will both have the same value. However there will be a difference during the backward pass: when you backprop through a ReLU layer, the upstream gradients will be zeroed in the same places the activations were zeroed during the forward pass; if you ask neural-style to work with a conv layer then it will not backprop through the ReLU during the backward pass.

This means that when you ask neural-style to use activations on a conv layer, ReLU gets used during the forward pass but not during the backward pass, so the backward pass will not be correct in this case. You can still get nice style transfer effects even when the gradients are incorrect in this way, but for this reason I'd generally expect better results using ReLU layers.

htoyryla commented 7 years ago

@jcjohnson, good point, I did not think about the backward pass.

ProGamerGov commented 7 years ago

I suspect that image quality affects training accuracy . This research paper seems to show the effects of image quality on training neural network models: "Understanding How Image Quality Affects Deep Neural Networks"

ProGamerGov commented 7 years ago

I recently trained a NIN model on a roughly sorted custom data set of about 40,000 faces. There appear to be direct improvements to how the model handles faces in terms of content images. But style images which do not have faces, do not work as well. I think that if one could train the model on artwork, in addition to common content images, it would help the model understand both.

htoyryla commented 7 years ago

I have sometimes been thinking about using two models, one for style, one for content, both trained with limited material. Don't know if it would work though, and memory usage would certainly be a problem. Yet it could be an interesting exercise.

ProGamerGov commented 7 years ago

@htoyryla That idea could be more resource efficient by using two small NIN like models that are trained on one target category each only.


So it turns out that at least for the NIN model, it still has the knowledge required for style transfer, in addition to the newer face related knowledge that I gave it.

The unmodified NIN model is on the right, and the fine tuned NIN model is on the left:

I used a DeepDream project based on Neural-Style to try and determine why things had changed in the modified NIN model. Below are the DeepDream layer activation tests for all 29 layers used by the NIN model:

The original model:

The modified model:

These DeepDream images helped me figure out that by simply changing the -content_layers and -style_layers, I could utilize the improved facial feature detection abilities of my fine tuned NIN model.

The NIN model itself that I created, had 15700 iterations during training, and seemed to maintain 86-96% accuracy during the last couple thousand iterations. With around 40k training images, I calculated around 24-25 epochs occurred during the training session? I also stopped the training 11600 iterations, in order to lower the learning rate so that the loss would continue going down. I'm not sure if I was over-fitting the model, but it seemed to have improved abilities on an image that it was not apart of the training data set.

After the NIN experiments, I attempted to fine tuned a VGG-16 model on my rough faces data set. It's a lot slower to fine tune VGG-16 models than it is to fine tune NIN models. From iterations 1000 to 8000, it seems that the model is actually improving on it's ability to recognize facial features:

The output from the non fine tuned SOD_FINETUNE model can be found here: https://i.imgur.com/wWtWysT.png

Obviously for my experiments I used the exact same parameters, seed values, etc... to eliminate any other things that might cause different outputs.

An album with the full versions of the images I posted in this comment can be found here: https://imgur.com/a/njDJ1

Edit:

To clarify, the VGG-16 model that I fine tuned is called the "VGG-16 SOD Finetune" model. The "finetune" in the original model's name is because it was fine tuned for salient object detection from the regular VGG-16 model. I have now fine tuned this previously fine tuned model, with a new data set.

ProGamerGov commented 7 years ago

Trying to train a NIN model from scratch with my data set did not work, and only produces blurry style transfer images, and broken DeepDream images. Maybe there are certain classes that help the model learn other classes? Or maybe I just choose bad training parameters?

Edit:

Analyzing the training loss (idk what graphing tool to use), it appears like the NIN model from scratch had the loss decrease quickly, and then stay constant. For the fine-tuned NIN model, the training loss dropped quickly and seems to have very slowly decreased/maybe stayed the same. Though it must have worked better than when I tried to train from scratch, seeing as it does appear to have better facial feature detection abilities.

The fine tuned SOD model has the loss drop continuously over time, which I imagine looks like what one should expect with good training parameters. So I think the results from my fine tuned NIN model are questionable and needs better training parameters, but the VGG-16 SOD model seems to actually be improved in a way that is appropriately reflected in the loss values.

Second Edit:

After some more testing on my fine tuned SOD model, it appears that I may have actually improved the model with very little change to the model's other abilities. It now more accurately deals with faces, and possibly other parts of the human body (upper portion of the body I think).

I wonder if the "roughly sorted" nature of my data set helps the model's new abilities, or weakens the model's new abilities?

ProGamerGov commented 7 years ago

The training loss graphs seem to support my results.

The NIN model from scratch is on the left, and the fine tuned NIN model is on the right:

The fine tuned VGG-16 model:

I think using a larger batch size (64 instead of less than 10) compared to earlier experiments, is part of the reason for this recent training success.

ProGamerGov commented 7 years ago

I think I might be onto something here as my fine tuned model appears to be better at facial feature preservation:

An album with the full images can be found here: https://imgur.com/a/tArrY

It looks as though my fine tuned model is more accurately detecting the eyes, and mouth of the person in the photo.

The solver.prototxt file and the train_val.prototxt can be found here: https://gist.github.com/ProGamerGov/2bdf7659ee14dac03269a3ec3a7f1fcd

ProGamerGov commented 7 years ago

Imagemagick seems to be slow for resizing large data sets of images (especially when using the -resize option), but using parallel like this makes it faster:

parallel -j 8 convert {} -resize '256x256^' -gravity Center {} ::: *.png
parallel -j 8 mogrify {} -format png {} ::: *.jpg

You can get Parallel via sudo apt-get install parallel.

Source: https://stackoverflow.com/questions/26168783/imagemagick-convert-and-gnu-parallel-together

ProGamerGov commented 7 years ago

I uploaded the Rough Faces model and added a link to download it on the alternative models wiki page: https://github.com/jcjohnson/neural-style/wiki/Using-Other-Neural-Models

Hopefully it can help those seeking better facial preservation with Neural-Style.

htoyryla commented 7 years ago

For me, the Rough Faces model didn't work well. For a faces specific model, I would expect it to have strong activations for features like head/face shape, hair, eyes, nose, mouth etc. Here, it mainly picked up the merengue desert in the background :)

content image

hannu512z

style image

naama007

result

out

ProGamerGov commented 6 years ago

I did test your content image with my fine tuned model, and I think the issue may be that the "rough faces" training data was not very diverse and as a result it performs best with certain images. My example image for testing, was also a part of training data, so that may skew the results (though I did test it on other images that I think were not part of the training data).

ProGamerGov commented 6 years ago

I was looking through my old experiments, and I see that I didn't seem to actually share the two successfully fine-tuned models that I had created. The one model in particular (The "Plaster" model) creates a very different output than the non fine-tuned version.

Some experimentation with parameters may be required to achieve satisfactory results as like with all the models I trained and fine-tuned, I would only test them in Neural-Style with certain parameter values.

I'm not sure if the "Low Noise" model is actually different than the non fine-tuned model in a way that that's useful for certain styles like the "Plaster" model is, so it can be removed if it's not useful.

I posted the models on the wiki page here: https://github.com/jcjohnson/neural-style/wiki/Using-Other-Neural-Models

Seeing as both models are from 2016, I am going to test them with a bunch of more "modern" Neural-Style parameters, like setting the TV weight to 0, using the Adam parameters I discovered in addition to L-BFGS, and using multiscale resolution.

Nusrat12 commented 5 years ago

I want to change iteration numbers.Where I have to change?

ProGamerGov commented 5 years ago

@Nusrat12 You control the maximum number of iterations by setting the max_iter: value in your solver.prototxt. You can see an example of it here.