SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing (CVPR 2022)

Yichun Shi, Xiao Yang, Yangyue Wan, Xiaohui Shen

Recent studies have shown that StyleGANs provide promising prior models for downstream tasks on image synthesis and editing. However, since the latent codes of StyleGANs are designed to control global styles, it is hard to achieve a fine-grained control over synthesized images. We present SemanticStyleGAN, where a generator is trained to model local semantic parts separately and synthesizes images in a compositional way. The structure and texture of different local parts are controlled by corresponding latent codes. Experimental results demonstrate that our model provides a strong disentanglement between different spatial areas. When combined with editing methods designed for StyleGANs, it can achieve a more fine-grained control to edit synthesized or real images. The model can also be extended to other domains via transfer learning. Thus, as a generic prior model with built-in disentanglement, it could facilitate the development of GAN-based applications and enable more potential downstream tasks.

Description

Official Implementation of our SemanticStyleGAN paper for training and inference.

Installation
Pretrained Models
Inference
Training
Credits
Acknowledgments
Citation

Installation

Python 3
Pytorch 1.8+
Run pip install -r requirements.txt to install additional dependencies.

Pretrained Models

In this repository, we provide pretrained models for various domains.

Path	Description
CelebAMask-HQ	Trained on the CelebAMask-HQ dataset.
BitMoji	Fine-tuned on the re-cropped BitMoji dataset.
MetFaces	Fine-tuned on the MetFaces dataset.
Toonify	Fine-tuned on the Toonify dataset.

Inference

Synthesis

Random Synthesis

In visualize/generate.py, we provide a script for sampling random images and their corresponding segmentation masks with SemanticStyleGAN. An example command is provided below:

python visualize/generate.py \
pretrained/CelebAMask-HQ-512x512.pt \
--outdir results/samples \
--sample 20 \
--save_latent

The --save_latent flag will save the w latent code of each synthesized image in a separate .npy file.

Local Latent Interpolation

In visualize/generate_video.py, we provide a script for visualizing the local interpolation by SemanticStyleGAN. An example command is provided below:

python visualize/generate_video.py \
pretrained/CelebAMask-HQ-512x512.pt \
--outdir results/interpolation \
--latent results/samples/000000_latent.npy

Here, /results/samples/000000_latent.npy is the latent code either generated by visualize/generate.py or output by visualize/invert.py. You can also ignore the --latent argument for generating a video with a random latent code. The scripts will create several mp4 files under the output folder, each shows the interpolation animation in a specific latent subspace.

Synthesizing Components

In visualize/generate_components.py, we provide a script for visualizing the components synthesized by SemanticStyleGAN, where we gradually add more local generators into the synthesis procedure. An example command is provided below:

python visualize/generate_components.py \
pretrained/CelebAMask-HQ-512x512.pt \
--outdir results/components \
--latent results/samples/000000_latent.npy

You can also ignore the --latent argument for generating components for a random latent code.

Inversion

Optimization-based

You can use visualize/invert.py for inverting real images into the latent space of SemanticStyleGAN via optimization:

python visualize/invert.py \
--ckpt pretrained/CelebAMask-HQ-512x512.pt \
--imgdir data/examples \
--outdir results/inversion \
--size 512

This script will save the reconstructed images and their corresponding w-plus latent codes in separate sub-directories under the outdir. Additionally, you can set --finetune_step to a non-zero integer (e.g. 300) for pivotal tuning inversion, which outputs a new fine-tuned generator for each image.

You can manipulate the reconstructed faces by using the saved latent codes. You can also choose to edit the face with a fine-tuned generator from PTI or domain adaptation. An example command is provided below:

python visualize/generate_video.py \
pretrained/BitMoji-512x512.pt \
--outdir results/interpolation_inversion \
--latent results/inversion/latent/1.npy

Here is an example result of changing the inverted latent code of eyes using the BitMoji generator:

Computing Metrics

Given a trained generator and a prepared inception file, we can compute the metrics with following command:

python calc_fid.py \
--ckpt /path/to/checkpoint \
--inception /path/to/inception/file

Training

Data Preparation

In our work, we use re-mapped segmentation labels of CelebAMask-HQ. To reproduce this dataset, first download the original CelebAMask-HQ dataset from here and decompress it to data/CelebAMask-HQ. Then, run the following command to create the images and labels used for training:
```
python data/preprocess_celeba.py data/CelebAMask-HQ
```
The script will create four folders under the data/CelebAMask-HQ that contain the images and labels for training and testing, respectively.
Similar to rosinality's implementation of StyleGAN2, we use LMDB datasets for training. An example command is provided below:
```
python prepare_mask_data.py
data/CelebAMask-HQ/image_train \
data/CelebAMask-HQ/label_train \
--out data/lmdb_celebamaskhq_512 \
--size 512
```
You can also use your own dataset for the step. Note that the mask labels and image files are matched according to file names. It is okay if the files are under sub-directories. But make sure the base names are unique.

Prepare the inception file for calculating FID:

python prepare_inception.py
data/lmdb_celebamaskhq_512
--output data/inception_celebamaskhq_512.pkl \
--size 512
--dataset_type mask

Training SemanticStyleGAN

The main training script can be found in train.py. Here, we provide an example for training on the CelebAMask-HQ that we prepared as above :

python train.py \
--dataset data/lmdb_celebamaskhq_512 \
--inception data/inception_celebamaskhq_512.pkl \
--checkpoint_dir checkpoint/celebamaskhq_512 \
--seg_dim 13 \
--size 512 \
--transparent_dims 10 12 \
--residual_refine \
--batch 16 \

or you can use the following command for multi-gpu training (we assume 8 gpus are available):

python -m torch.distributed.launch --nproc_per_node=8 \
train.py \
--dataset data/lmdb_celebamaskhq_512 \
--inception data/inception_celebamaskhq_512.pkl \
--checkpoint_dir checkpoint/celebamaskhq_512 \
--seg_dim 13 \
--size 512 \
--transparent_dims 10 12 \
--residual_refine \
--batch 4

Here, --seg_dim refers to the number of segmentation classes (including background). --transparent_dims specifies the classes that are treated to be possibly transparent.

If you want to restore from an intermediate checkpoint, simply add the argument --ckpt /path/to/chekcpoint/file where the checkpoint file is a .pt file saved by our training script.

Additionally, if you have tensorboard installed, you can visualize tensorboard logs in the checkpoint_dir.

Domain Adaptation

In train_adaptation.py, we provide a script for performing domain adaptation on image-only datasets. To do this, you first need to create an LMDB for the target image dataset. A example command is provided below:

python prepare_image_data.py \
data/metfaces/images \
--size 512 \
--out data/lmdb_metfaces_512

Then, you can run the following command for fine-tuning on the target dataset:

python -m torch.distributed.launch --nproc_per_node=8 \
train_adaptation.py \
--ckpt pretrained/CelebAMask-HQ-512x512.pt \
--dataset data/lmdb_metfaces_512 \
--checkpoint_dir checkpoint/metfaces \
--seg_dim 13 \
--size 512 \
--transparent_dims 10 12 \
--residual_refine \
--batch 4 \
--freeze_local

The --freeze_local flag will freeze the local generators during training, which preserves the spatial disentanglement. However, for datasets that has a large geometric difference from the real faces (e.g. BitMoji), you may want to remove this argument. In fact, we found that our model is still able to preserve the disentanglement within a few thousand steps of fine-tuning all modules.

Note that the dataloader for adaptation is compatiable with rosinality's implementation, so you can use the same LMDB datasets for fine-tuning SemanticStyleGAN. By default we fine-tune the model for 2000 steps, but you may want to look at the visualization samples for early stopping.

Credits

StyleGAN2 model and implementation:
https://github.com/rosinality/stylegan2-pytorch
Copyright (c) 2019 Kim Seonghyeon
License (MIT) https://github.com/rosinality/stylegan2-pytorch/blob/master/LICENSE

LPIPS model and implementation:
https://github.com/S-aiueo32/lpips-pytorch
Copyright (c) 2020, Sou Uchida
License (BSD 2-Clause) https://github.com/S-aiueo32/lpips-pytorch/blob/master/LICENSE

ReStyle model and implementation:
https://github.com/yuval-alaluf/restyle-encoder
Copyright (c) 2021 Yuval Alaluf
License (MIT) https://github.com/yuval-alaluf/restyle-encoder/blob/main/LICENSE

Please Note: The CUDA files are made available under the Nvidia Source Code License-NC

Acknowledgments

This code is initialy built from SemanticGAN.

Citation

If you use this code for your research, please cite the following work:

@inproceedings{shi2021SemanticStyleGAN,
author    = {Shi, Yichun and Yang, Xiao and Wan, Yangyue and Shen, Xiaohui},
title     = {SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing},
booktitle   = {CVPR},
year      = {2022},
}

seasonSH / SemanticStyleGAN

readme