Training Results in Videos with Spring Animation

CodeBlo commented 2 years ago

Hi,

We have a dataset where a liquid flowing in the water from right to left. We are trying to generate similar videos using StyleGAN-V. But the produced videos are have a spring like animation, ie. at first video moves from right to left then left to right. For example the video starts with a nice motion from right to left but after some time it begins to go from left to right:

Trained Videos Merged: https://drive.google.com/file/d/1tk7y0Tb_nnpTN0NjWbTkVPia-a-MOINN/view?usp=sharing
Example Generated Video https://user-images.githubusercontent.com/55738066/171813197-cefedf40-9b7c-4802-a3d9-531dfb35d722.mp4
Our Config File: (GitHub did not accept yaml so I had to convert to txt :( ) experiment_config.txt

Will more training solve the issue or is there any optimization that we can do?

Thanks!

JCBrouwer commented 2 years ago

I've noticed the same thing in my own training runs as well. My first instinct was that it's related to mirroring the dataset, but it looks like you have that turned off!

All my videos are dominated by two modes of motion. A large scale left-to-right movement and an undulating, faster, up-and-down flashing movement.

I'm starting to think this is inherent to the current design of the motion encoder.

Back in February I tried cleaning up the research zip from #1 and got these results training the motion encoder from scratch: https://gfycat.com/gloriousgrizzledhadrosaurus (not exactly sure what the settings were, but I think my motion_z_distance was too short, leading to the extreme quick motions)

With the release of the official code I tried again starting from the pre-trained faces checkpoint: https://gfycat.com/generalquarterlykudu Config: https://pastebin.com/WqrygJMA The results are definitely smoother (probably because of the long motion_z_distance / the better starting point), but still this large scale left-right movement is very apparent in all of the videos.

The reason I think it might be inherent is that the same effect is in the pre-trained checkpoint which I started from, here's the video from the start of training with the unchanged faces checkpoint. https://gfycat.com/bouncyagonizingaustraliankestrel It also contains the same undulating, flashing, periodic motion!

The same effect is also clearly visible in the SkyTimelapse GIF in the README. Look at how all the clouds make a long movement right and then a long movement back to the left.

Would love to know if there is a way to change up the motion encoder (or anything else?) to reduce this effect!

(paging @universome, thank you for the amazing work by the way :)

skymanaditya1 commented 2 years ago

Hi,

We have a dataset where a liquid flowing in the water from right to left. We are trying to generate similar videos using StyleGAN-V. But the produced videos are have a spring like animation, ie. at first video moves from right to left then left to right. For example the video starts with a nice motion from right to left but after some time it begins to go from left to right:

Trained Videos Merged: https://drive.google.com/file/d/1tk7y0Tb_nnpTN0NjWbTkVPia-a-MOINN/view?usp=sharing

Example Generated Video https://user-images.githubusercontent.com/55738066/171813197-cefedf40-9b7c-4802-a3d9-531dfb35d722.mp4

Our Config File: (GitHub did not accept yaml so I had to convert to txt :( ) experiment_config.txt

Will more training solve the issue or is there any optimization that we can do?

Thanks!

I had faced a similar issue. I guess it could be because of the augmentations that you are using. In your config file, you have bgc as the aug_pipe which has different augmentations like rotation, flipping, etc. I guess that could be the reason for observing the motion in two different directions.

JCBrouwer commented 2 years ago

I had faced a similar issue. I guess it could be because of the augmentations that you are using. In your config file, you have bgc as the aug_pipe which has different augmentations like rotation, flipping, etc. I guess that could be the reason for observing the motion in two different directions.

In my case, at least, I have more 100k frames in the dataset so I'm quite confident there isn't any augmentation leakage. I've only ever seen that with very small datasets (<2000 imgs).

universome commented 2 years ago

Hi! To be honest, I believe that the issue you report does not seem to be easily fixable. I attribute it to the fact that the generator uses just a single 512-dimensional content code (w) while you are trying to generate an "infinite" amount of different content from it. But there are other factors at play as well.

To mitigate it, I would try are the following things:

I am somewhat surprised that the discriminator does not catch the reversed movement. It could be explainable if you trained with 2 frames per clip, but as I see you are using 3 of them. Did you try switching off horizontal flips (xflip=1 => xlip=0) in the augmentation pipe? In this case, the discriminator will be seeing just a single movement direction all the time during training, which could help.
Passing motion codes via modulation rather than concatenation (maybe even removing w codes completely). This can be done by setting cond_type to sum_w or concat_w in the config. In this way, it would be similar in spirit to having "infinite" amount of content codes.
Using spatio-temporal time embeddings specifically for your dataset. This means, that instead of concatenating time embeddings p_\theta(t) to the constant input tensor (which depend only on time), you concatenate p_w(t, x, y) — joint positional embeddings, similar to DIGAN. Also, the role of the mapping network is somewhat unclear for your dataset since all the videos are of the same style.
Processing the frame features in the discriminator in a more sophisticated way, rather than just concatenation and using more frames per clip. For example, with rnns/temporal convs/attention and using absolute time positional embeddings for them. It should help the discriminator to catch the reversed movement.
Replacing StyleGAN2 generator backbone with the StyleGAN3 backbone. I have a suspicion that it is difficult for the SG2 generator to "draw" content wherever it wants due to 3x3 convolutions (which entangles pixels together) and not having positional embeddings, which makes it more difficult for it to "execute" movement in general. Using SG3 will result in having spatio-temporal positional embeddings for pixels.
Adapting a scheme similar to ALIS, where we "merge" an infinite amount of content codes together. But it's not clear, how to "move" them.

Is the dataset you are using available publicly?

JCBrouwer commented 2 years ago

Thanks for the in-depth response @universome !

I'll definitely have a look at some of your suggestions. It seems to me maybe it also makes sense to supply the w-code to the motion encoder. Some motions might only be valid for certain styles and not for others, but now the motion encoder does not have this information.

Have you seen Generating Long Videos of Dynamic Scenes? Looks very promising! Of course they're using much more compute because they work with dense spatio-temporal representations all the way through. Perhaps still some of their temporal-coherency-focused ideas can be ported over into the motion encoder here for gains.

universome / stylegan-v

Training Results in Videos with Spring Animation #19