shrubb / latent-pose-reenactment

The authors' implementation of the "Neural Head Reenactment with Latent Pose Descriptors" (CVPR 2020) paper.
https://shrubb.github.io/research/latent-pose-reenactment/
Apache License 2.0
180 stars 34 forks source link

Could you provide some missing files and double-check the meta-model checkpoint? #1

Closed kenmbkr closed 3 years ago

kenmbkr commented 3 years ago

The README mentions the argument --config finetuning-base in the fine-tuning step and a training configuration configs/default.yaml in the training step. I suppose the config directory was not committed.

The preprocessing script uses a file inference_folder.py for Graphonomy, is it a custom script modified from the original inference.py? If so, could you provide the script?

Without the finetuning-base config, I manually added the --finetune argument for fine-tuning but encountered the following errors. How should I resolve it?

PID 10490 - 2020-10-29 11:14:39,776 - INFO - utils.load_config_file - Using config configs/finetuning-base.yaml
PID 10490 - 2020-10-29 11:14:39,776 - WARNING - utils.get_args_and_modules - Could not load config finetuning-base
PID 10490 - 2020-10-29 11:14:39,776 - INFO - utils.get_args_and_modules - Loading checkpoint file checkpoints/latent-pose-release.pth
PID 10490 - 2020-10-29 11:14:41,345 - INFO - utils.setup - Random Seed: 123
PID 10490 - 2020-10-29 11:14:42,995 - INFO - train.py - Initialized the process group, my rank is 0
PID 10490 - 2020-10-29 11:14:42,995 - WARNING - train.py - Sorry, multi-GPU fine-tuning is NYI, setting `--num_gpus=1`
PID 10490 - 2020-10-29 11:14:42,995 - INFO - train.py - Loading dataloader 'voxceleb2_segmentation_nolandmarks'
PID 10490 - 2020-10-29 11:14:43,133 - INFO - dataloaders.common.voxceleb.get_part_data (train) - Determining the 'train' data source
PID 10490 - 2020-10-29 11:14:43,133 - INFO - dataloaders.common.voxceleb.get_part_data (train) - Checking if '/path/latent-pose-reenactment/data/VoxCeleb1_test_finetuning/images-cropped/id10280/XiKRlssBw2M/000330#001148.mp4' is a directory...
PID 10490 - 2020-10-29 11:14:43,133 - INFO - dataloaders.common.voxceleb.get_part_data (train) - Yes, it is; the only train identity will be 'id10280/XiKRlssBw2M/000330#001148.mp4'
PID 10490 - 2020-10-29 11:14:43,138 - INFO - dataloaders.common.voxceleb.get_part_data (train) - This dataset has 818 images
PID 10490 - 2020-10-29 11:14:43,139 - INFO - dataloaders.common.voxceleb.get_part_data (train) - Setting `args.num_labels` to 1 because we are fine-tuning or the model has been fine-tuned
PID 10490 - 2020-10-29 11:14:43,148 - WARNING - dataloader - Could not find the '.npy' file with bboxes, will assume the images are already cropped
PID 10490 - 2020-10-29 11:14:43,148 - INFO - dataloaders.augmentation - Pixelwise augmentation: True
PID 10490 - 2020-10-29 11:14:43,148 - INFO - dataloaders.augmentation - Affine scale augmentation: True
PID 10490 - 2020-10-29 11:14:43,148 - INFO - dataloaders.augmentation - Affine shift augmentation: True
PID 10490 - 2020-10-29 11:14:43,160 - INFO - dataloaders.dataloader - This process will receive a dataset with 409 samples
PID 10490 - 2020-10-29 11:14:43,160 - INFO - train.py - Starting from checkpoint checkpoints/latent-pose-release.pth
PID 10490 - 2020-10-29 11:14:43,160 - INFO - utils.load_model_from_checkpoint - Loading embedder 'unsupervised_pose_separate_embResNeXt_segmentation'
PID 10490 - 2020-10-29 11:14:44,027 - INFO - utils.load_model_from_checkpoint - Loading generator 'vector_pose_unsupervised_segmentation_noBottleneck'
PID 10490 - 2020-10-29 11:14:44,552 - INFO - utils.load_model_from_checkpoint - Loading discriminator 'no_landmarks'
PID 10490 - 2020-10-29 11:14:45,501 - WARNING - utils.load_model_from_checkpoint - Discriminator has changed in config (maybe due to finetuning), so not loading `optimizer_D`
PID 10490 - 2020-10-29 11:14:45,501 - INFO - utils.load_model_from_checkpoint - Loading runner holycow
PID 10490 - 2020-10-29 11:14:45,502 - WARNING - utils.load_model_from_checkpoint - Embedder or generator has changed in config, so not loading `optimizer_G`
PID 10490 - 2020-10-29 11:14:45,503 - INFO - train.py - Starting from iteration #2714183
PID 10490 - 2020-10-29 11:14:50,379 - WARNING - runner - Parameters mismatch in generator and the initial value of weights' running averages. Initializing by cloning
PID 10490 - 2020-10-29 11:14:50,386 - INFO - train.py - For fine-tuning, computing an averaged identity embedding from 409 frames
PID 10490 - 2020-10-29 11:14:53,639 - INFO - train.py - Entering training loop
/path/anaconda3/envs/latent-pose/lib/python3.7/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  warnings.warn("Default grid_sample and affine_grid behavior has changed "
Traceback (most recent call last):
  File "train.py", line 291, in <module>
    epoch, args, phase='train', writer=writer, saver=saver)
  File "/path/Documents/latent-pose-reenactment/runners/holycow.py", line 230, in run_epoch
    all_data_dict, losses_G_dict, losses_D_dict = training_module(data_dict, target_dict)
  File "/path/anaconda3/envs/latent-pose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/path/latent-pose-reenactment/runners/holycow.py", line 178, in forward
    crit_out = criterion(data_dict)
  File "/path/anaconda3/envs/latent-pose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/path/latent-pose-reenactment/criterions/dis_embed.py", line 22, in forward
    fake_embed = data_dict['embeds_elemwise']
KeyError: 'embeds_elemwise'
shrubb commented 3 years ago

Hi, and thank you for letting me know. Indeed, I forgot to add configs/ and Graphonomy because they were .gitignored.

Can you git pull and try again?

kenmbkr commented 3 years ago

I pulled the latest code and fine-tuning works for me. However, I am not able to test segmentation by Graphonomy due to the unavailability of the universal trained model. I have opened an issue but did not get a response. Please share the universal trained model if that is okay for you.

I tested face reenactment but I am not able to reproduce the results as in the paper. I used the provided script for preprocessing. Even though my video is in 256x256, the script still adds the FFHQ-style reflection padding to the images-cropped results. When the cropped images have reflection padding, I got the following result. If I put the original images in the images-cropped folder, I got the following result.

Could you kindly provide image examples after preprocessing or detailed instructions on how to reproduce the resuls as in the paper?

shrubb commented 3 years ago

It's a shame Graphonomy's link has expired. My bad I didn't check that, sorry. I've uploaded the checkpoint myself and have updated the instructions.

As for the preprocessing, the provided checkpoint is sensitive to the proper cropping of both identity sources and drivers! utils/preprocess_dataset.sh — or, more precisely, utils/crop_as_in_dataset.py — do just that, and they also do reflection padding if necessary, so yes, that padding is expected. In your example, the driver wasn't cropped, so make sure to follow all steps of the instructions and please report back if it worked.

kenmbkr commented 3 years ago

I did reenactment with the provided Graphonomy checkpoint and got the following result. There seem to be segmentation or reflection padding artifacts in every frame. In the paper all figures except Figure 5 have no reflection padding.

For preprocessing, I used the following options.

DO_DECODE_VIDEOS=\
true

DO_CROP=\
true
DO_COMPUTE_SEGMENTATION=\
false
DO_COMPUTE_LANDMARKS=\
false
DO_COMPUTE_POSE_3DMM=\
false

DO_CROP_FFHQ=\
false
DO_COMPUTE_SEGMENTATION_FFHQ=\
false

For finetuning, I used 230 iterations for one identity image as suggested. I am not sure whether NUM_IMAGES=5 refers to how many images are there, the indexes of the images, or something else, so I just leave it as it is. The instructions mentioned underfitting or overfitting. I can see in the Tensorboard the losses are decreasing, but there are no ways for me to tell if that is a good fit or not. It would be great if can provide some rough numbers on how low the losses we should be expecting.

Screenshot_20201030_161551

Please kindly let me know what else I may have missed in reproducing the results.

shrubb commented 3 years ago

I did reenactment with the provided Graphonomy checkpoint and got the following result.

I can confirm that your result is valid, it looks reasonable.

There seem to be segmentation or reflection padding artifacts in every frame.

The reflection padding is expected if the face is cropped too tightly in the original images. If that happens for the driver, it's totally fine; if that happens for the identity, segmentation quality may suffer (like in your case) or at least the blurred reflection will be visible in the reenactment output.

To improve your particular result, you can get identity image(s) with a wider crop or of better quality, or in a pinch improve masks in $DATASET_ROOT/segmentation-cropped/$IDENTITY_NAME manually (via image editing software).

In the paper all figures except Figure 5 have no reflection padding.

Yes, because those images originally contained a larger region around the face. Though I have to admit that in this repository utils/preprocess_dataset.sh computes segmentation on already cropped (256x256) images. Such a "quick and dirty" solution is faster, but isn't optimal in terms of quality. As I said above, if you really need good segmentation masks, you should compute them on full images manually or edit them by hand. Or just feed in good quality identity images (usually the case with original VoxCeleb images), which is often enough too.

I am not sure whether NUM_IMAGES=5 refers to how many images are there, the indexes of the images, or something else

This was to indicate how many images are there. Sorry for confusion. I've just improved the script so you don't have to set this manually anymore.

I can see in the Tensorboard the losses are decreasing, but there are no ways for me to tell if that is a good fit or not.

The "IMAGES" tab should be the most useful monitor for you. Open it and see when the identity gap is low enough. I've also improved the README on that.

kenmbkr commented 3 years ago

Thank you for your detailed explanations. I selected this image as the identity, which should have enough space for cropping. I finetuned for 230 and 460 iterations and below shows the tensorboard images. These are the reenactment results for 230 and 460 iterations.

I observed the followings in the results.

What else would I have missed in reenactment?

obama

individualImage

individualImage

shrubb commented 3 years ago

The mouth shape is ambiguous.

Oh, I realized I made a silly error in the instructions, just corrected it. You should increase the number of iterations with the number of images. So, try less iterations, around 125.

The reflection paddings are still present.

You could use the driving video where there's originally more space around the face. Though I doubt this will improve something other than segmentation. Having more space in the identity image is more important.

The identity gap is still large.

There's not much I can suggest 🙁 There's a trade-off between identity gap and mimics. You could play with loss weights (like, increase facial identity loss' weight), try other images or videos, or try a better driver. No guarantees.

kenmbkr commented 3 years ago

There's a trade-off between identity gap and mimics.

Could you shed light on why identity and pose have to be a trade-off but cannot be jointly optimized? Is it limited to only the architecture of this paper (latent-pose reenactment) or the same limitation occurs also in similar meta-learning models such as few-shot talking head or bi-layer one-shot?

shrubb commented 3 years ago

Could you shed light on why identity and pose have to be a trade-off but cannot be jointly optimized?

Because we fine-tune to an extremely small dataset (as small as 1 image). So, overfitting is almost inevitable and should be carefully handled (notably, by choosing the right number of iterations).

the same limitation occurs also in similar meta-learning models such as few-shot talking head or bi-layer one-shot?