pkhungurn / talking-head-anime-demo

Demo for the "Talking Head Anime from a Single Image."
MIT License
1.99k stars 286 forks source link

About two-algo-face-rotator #3

Closed deepkyu closed 4 years ago

deepkyu commented 4 years ago

Hello, @pkhungurn. First of all, it's a great honor for seeing your excellent project. :smile: BTW, I have a question while seeing your code with your blog post

Here in the figure above, it looks that you adapt the module of Appreance flow (Zhou et al. ECCV 2016).

device = self.zhou_grid_change.weight.device
identity = torch.Tensor([[1, 0, 0], [0, 1, 0]]).to(device).unsqueeze(0).repeat(n, 1, 1)
base_grid = affine_grid(identity, [n, c, h, w])
grid = base_grid + grid_change
resampled = grid_sample(image, grid, mode='bilinear', padding_mode='border')

However in the code tha/two_algo_face_rotator.py, it seems that you use a PyTorch function affine_grid, which builds Spatial Transformer Networks (M Jaderberg, et al. NIPS 2015).

Is the function of affine_grid corresponding with the implementation of appearance flow ? If not, is there any block (or snippet) that implements appearance flow?

Again, thank you for sharing your great project! See ya.

pkhungurn commented 4 years ago

The whole snippet that you quoted, plus the one line of code above it, is an implementation of Zhou et al.'s appearance flow algorithm. I reproduce the snippet in full below:

    grid_change = torch.transpose(self.zhou_grid_change(y).view(n, 2, h * w), 1, 2).view(n, h, w, 2)
    device = self.zhou_grid_change.weight.device
    identity = torch.Tensor([[1, 0, 0], [0, 1, 0]]).to(device).unsqueeze(0).repeat(n, 1, 1)
    base_grid = affine_grid(identity, [n, c, h, w])
    grid = base_grid + grid_change
    resampled = grid_sample(image, grid, mode='bilinear', padding_mode='border')

As a result, the affine_grid function does not corresponds to the implementation of the appearance flow algorithm. However, it is a part of that implementation as it is one of the six lines of the snippet above.

I think that you might be confused because you mentioned that "affine_grid [...] builds a Spatial Transformer Networks." This is not true. What affine_grid does is that it creates a flow field that acts like an affine transformation given as one of its argument. Jaderberg et al. have a part of their Spatial Transformer Networks learn to predict this argument. The flow field produced by affine_grid is then fed to grid_sample, which finally produces a resampled image for further processing. As a result, the Pytorch documentation for affine_grid says:

"This function is often used in conjunction with grid_sample() to build Spatial Transformer Networks ."

Note here that it is "used [...] to build Spatial Transformer Networks." So, it is a part of Spatial Transformer Networks. However, it does not "build" Spatial Transformer Networks.

Moreover, even if affine_grid is a part of Spatial Tarnsformer Networks, nothing prevents it from being used to build other networks, including Zhou et al.'s.

This brings us to the main difference between Zhou et al.'s paper and Jaderberg et al.'s paper. Zhou et al.'s predicts the whole flow field. However, Jaderberg et al.'s predicts the transformations (being affine ones or thin plate splines) that are used to create the flow field.

My snippet implements Zhou et al.'s algorithm because it also predicts the whole flow field. That is the "grid" variable. The way it computes the grid variable is not direct, and this might have caused your confusion. It starts with the "base_grid," which represents a flow field that copies all the pixel to the same location. The base_grid is created by calling affine_grid with a fixed transformation (the identity). The fact that this transformation is not learned means that my snippet does not implement a Spatial Transformer Network.

The main prediction is when it computes the grid_change variable on the first line. The grid_change acts as an offset to the base_grid, and so, after these two are added together, the whole flow field is computed. I did this because a lot of pixels in the input image (especially those of the character's body) would remain unchanged, so it would be easier for the network to learn the offsets rather than creating the whole flow field from scratch.

I hope this answer your questions.