manuelruder / fast-artistic-videos

Video style transfer using feed-forward networks.
Other
378 stars 47 forks source link
style-transfer

fast-artistic-videos

This is the source code for fast video style transfer described in

Artistic style transfer for videos and spherical images
Manuel Ruder, Alexey Dosovitskiy, Thomas Brox

The paper builds on A Neural Algorithm of Artistic Style by Gatys et al. and Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Johnson et al. and our code is based on Johnson's implementation fast-neural-style.

It is a successor of our previous work Artistic style transfer for videos and runs several orders of magnitudes faster.

Example videos:

Comparison between the optimization-based and feed-forward approach:

Artistic style transfer for videos and spherical images

360° video:

If you find this code useful for your research, please cite

@Article{RDB18,
  author       = "M. Ruder and A. Dosovitskiy and T. Brox",
  title        = "Artistic style transfer for videos and spherical images",
  journal      = "International Journal of Computer Vision",
  month        = " ",
  year         = "2018",
  note         = "online first",
  url          = "http://lmb.informatik.uni-freiburg.de/Publications/2018/RDB18"
}

Table of contents

Setup

Disclaimer: Please note that this repository is no longer actively developed. Furthermore, the framework it uses, torch, is not maintained anymore and probably incompatible with most recent software environments. I have collected possible workarounds at the bottom of this chapter, however, there is neither a guarantee that this will work nor will there be any support from my side to get this code to run on recent environments.

First install Torch, then update / install the following packages:

luarocks install torch
luarocks install nn
luarocks install image
luarocks install lua-cjson
luarocks install hdf5

(Optional) GPU Acceleration

If you have an NVIDIA GPU, you can accelerate all operations with CUDA.

First install CUDA, then update / install the following packages:

luarocks install cutorch
luarocks install cunn

Also install stnbhwd (GPU accelerated warping) included in this repository:

cd stnbhwd
luarocks make stnbdhw-scm-1.rockspec

For CUDA version 9.0 and later, you must adapt the arch flag in CMakeLists.txt at line 55 to your GPU and CUDA version.

If you can not get stnbhwd to run but you want to use GPU acceleration at least for the stylization, remove all instances of require 'stn' from the code and edit the warp_image function in utilities.lua and remove everything in that function but line 147.

(Optional) cuDNN

When using CUDA, you can use cuDNN to accelerate convolutions and reduce memory footprint.

First download cuDNN and copy the libraries to /usr/local/cuda/lib64/. Then install the Torch bindings for cuDNN:

luarocks install cudnn

Workarounds for installing with recent Ubuntu / CUDA / cuDNN version

Some were able to fix erroneous results by downgrading torch, others by downgrading CUDA.

But what worked for me and also fixes some incompatibilities (Ubuntu 18.04, CUDA10, cuDNN7) was this fix.

Also for CUDA10, you need this fix.

And for Ubuntu 18.04 in order to install and use hdf5, you need this fix.

Optical flow estimator

Our algorithm needs an utility which estimates the optical flow between two images. Since our new stylization algorithm only needs a fraction of the time compared to the optimization-based approach, the optical flow estimator can become the bottleneck. Hence the choice of a fast optical flow estimator is crucial for near real time execution.

There are example scripts in our repository for either DeepFlow or FlowNet 2.0. DeepFlow is slower but comes as a standalone executable and is therefore very easy to install. Faster execution times can be reached with FlowNet 2.0 which runs on a GPU as well, given you have a sufficient fast GPU. FlowNet 2.0 was used for the experiments in our paper.

DeepFlow setup instructions

Just download both DeepFlow and DeepMatching (CPU version) and place the static binaries (deepmatching-static and deepflow2-static) in the root directory of this repository.

FlowNet 2.0 setup instructions

Go to flownet2 (GitHub) and follow the instructions there on how to download, compile and use the source code and pretrained models. Since FlowNet is build upon Caffe, you may also want to read Caffe | Installation for a list of dependencies. There is also a Dockerfile for easy installation of the complete code in one step: flownet2-docker (GitHub)

Then edit run-flownet-multiple.sh and set the paths to the FlowNet executable, model definition and pretrained weights.

If you have struggles installing Caffe, there is also a TensorFlow implementation: FlowNet2 (TensorFlow). However, you will have to adapt the scripts in this repository accordingly.

Please don't ask me for support installing FlowNet 2.0. Ask the original authors or use DeepFlow.

Pretrained Models

Download all pretrained video style transfer models by running the script

bash models/download_models.sh

This will download 6 video model and 6 image model files (~300MB) to the folder models/.

You can download pretrained spherical video models with download_models_vr.sh, it will download 2 models (~340MB). These models are larger because they have more filters. We later found that less filters can archive similar performance, but didn't retrain the spherical video models.

Running on new videos

Example script

You can use the scripts stylizeVideo_*.sh <path_to_video> <path_to_video_model> [<path_to_image_model>] to easily stylize videos using pretrained models. Choose one of the optical flow methods and specify one of the models we provide, see above. If image model is not specified, it will use the video model to generate the first frame (by marking everything as occluded). It will do all the preprocessing steps for you. For longer videos, make sure to have enough disk space available. This script will extract the video into uncompressed image files.

Advanced usage

For advances users, videos can be stylized with fast_artistic_videos.lua.

You must specify the following options:

By default, this script runs on CPU; to run on GPU, add the flag -gpu specifying the GPU on which to run.

Other useful options:

To use this script for evaluation, specify -evaluate and give the following options:

Running on new spherical videos

To stylize spherical videos, frames must be present as cube map projections with overlapping borders. Most commonly, however, spherical videos are encoded as equirectangular projection. Therefore, a reporjection becomes necessary.

Reprojection software

Transform360 can do the necessary transformations. To install, follow the instruction in their repository.

Example script

Given a successful Transform360 compilation and a vr video in equirectangular projection (most common format), you can use the script stylizeVRVideo_[deepflow|flownet].sh <path_to_equirectangular_projection_video> <path_to_pretrained_vr_model>. Make sure to place the ffmpeg binary dropped by Transform360 in the root directory of this repository. As above, also make sure to have enough disk space available for longer videos.

Advanced usage

See the example scripts above for a preprocessing pipeline. Each cube face must be stored in a separate file.

fast_artistic_videos_vr.lua has similar options than the video script with the following differences:

Training new models

Training a new model is complicated and requires a lot of preparation steps. Only recommended for advanced users.

Prerequisites

Note that you can omit some of these steps depending on the training parameters (see below). If you aim to reproduce the results in our paper, all steps are necessary though.

First, you need to prepare a video dataset consisting of videos from the hollywood2 dataset. This requires a lot of free hard drive capacity (>200 GB).

Secondly, to make use of the mixed training strategy, the spherical video training or the additional training data from simulated camera movement on single images, you also need to prepare a single image dataset as described by Johnson et al.. You may want to change image size to 384x384, since the algorithm takes multiple smaller crops per image and resizes them to 256x256.

Thirdly, you have to download the loss network from here and place it somewhere.

Fourthly, create a single image style transfer model as described by Johnson et al. (you can also use a pre-trained model if it has the same style). Please remember settings for style and content weight, style image size and other parameters that change the appearance of the stylized image. Then use the same parameters for the video net. Different parameters may cause unwanted results.

Training parameters

Now, you can start a training using train_video.lua with the following main arguments:

Besides that, the following optional arguments can be modified to customize the result:

Training data options:

Model options:

Optimization options:

Checkpointing:

Backend:

Training parameters for the results in our paper

Simple training (baseline):

th train_video.lua -data_mix video:3,shift:1,zoom_out:1 -num_frame_steps 0:1 -num_iterations 60000 -pixel_loss_weight 50 -arch c9s1-32,d64,d128,R128,R128,R128,R128,R128,U2,c3s1-64,U2,c9s1-3

Mixed training:

th train_video.lua -data_mix video:3,shift:1,zoom_out:1,single_image:5 -num_frame_steps 0:1 -num_iterations 60000 -pixel_loss_weight 100 -arch c9s1-32,d64,d128,R128,R128,R128,R128,R128,U2,c3s1-64,U2,c9s1-3

Multi-frame, mixed training:

th train_video.lua -data_mix video:3,shift:1,zoom_out:1,single_image:5 -num_frame_steps 0:1,50000:2,60000:4 -num_iterations 90000 -pixel_loss_weight 100 -arch c9s1-32,d64,d128,R128,R128,R128,R128,R128,U2,c3s1-64,U2,c9s1-3

Spherical videos:

First, train a video model of any kind.

Then, finetune on spherical images:

th train_video.lua -resume_from_checkpoint <checkpoint_path> --data_mix ...,vr:<n> -num_iterations <iter>+30000 -checkpoint_name ..._vr

where you have to replace <n> such that vr is presented exactly half of the time (e.g. 5 for simple training, 10 for multi-frame) and <iter>+30000 with 30000 added to the number of iterations of the previous model (i.e. we finetune for 30000 iterations), and use otherwise the same parameters as the video model. However, to avoid that the video model gets overwritten, change parameter checkpoint_name.

Contact

For issues or questions related to this implementation, please use the issue tracker. For everything else, including licensing issues, please email us. Our contact details can be found in our paper.

License

Free for personal or research use; for commercial use please contact us. Since our algorithm is based on Johnson's implementation, see also fast-neural-style #License.