universome / stylegan-v

[CVPR 2022] StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2
https://universome.github.io/stylegan-v
333 stars 36 forks source link

Training Results in Videos with Spring Animation #19

Open CodeBlo opened 2 years ago

CodeBlo commented 2 years ago

Hi,

We have a dataset where a liquid flowing in the water from right to left. We are trying to generate similar videos using StyleGAN-V. But the produced videos are have a spring like animation, ie. at first video moves from right to left then left to right. For example the video starts with a nice motion from right to left but after some time it begins to go from left to right:

Will more training solve the issue or is there any optimization that we can do?

Thanks!

JCBrouwer commented 2 years ago

I've noticed the same thing in my own training runs as well. My first instinct was that it's related to mirroring the dataset, but it looks like you have that turned off!

All my videos are dominated by two modes of motion. A large scale left-to-right movement and an undulating, faster, up-and-down flashing movement.

I'm starting to think this is inherent to the current design of the motion encoder.

Back in February I tried cleaning up the research zip from #1 and got these results training the motion encoder from scratch: https://gfycat.com/gloriousgrizzledhadrosaurus (not exactly sure what the settings were, but I think my motion_z_distance was too short, leading to the extreme quick motions)

With the release of the official code I tried again starting from the pre-trained faces checkpoint: https://gfycat.com/generalquarterlykudu Config: https://pastebin.com/WqrygJMA The results are definitely smoother (probably because of the long motion_z_distance / the better starting point), but still this large scale left-right movement is very apparent in all of the videos.

The reason I think it might be inherent is that the same effect is in the pre-trained checkpoint which I started from, here's the video from the start of training with the unchanged faces checkpoint. https://gfycat.com/bouncyagonizingaustraliankestrel It also contains the same undulating, flashing, periodic motion!

The same effect is also clearly visible in the SkyTimelapse GIF in the README. Look at how all the clouds make a long movement right and then a long movement back to the left.

Would love to know if there is a way to change up the motion encoder (or anything else?) to reduce this effect!

(paging @universome, thank you for the amazing work by the way :)

skymanaditya1 commented 2 years ago

Hi,

We have a dataset where a liquid flowing in the water from right to left. We are trying to generate similar videos using StyleGAN-V. But the produced videos are have a spring like animation, ie. at first video moves from right to left then left to right. For example the video starts with a nice motion from right to left but after some time it begins to go from left to right:

Will more training solve the issue or is there any optimization that we can do?

Thanks!

I had faced a similar issue. I guess it could be because of the augmentations that you are using. In your config file, you have bgc as the aug_pipe which has different augmentations like rotation, flipping, etc. I guess that could be the reason for observing the motion in two different directions.

JCBrouwer commented 2 years ago

I had faced a similar issue. I guess it could be because of the augmentations that you are using. In your config file, you have bgc as the aug_pipe which has different augmentations like rotation, flipping, etc. I guess that could be the reason for observing the motion in two different directions.

In my case, at least, I have more 100k frames in the dataset so I'm quite confident there isn't any augmentation leakage. I've only ever seen that with very small datasets (<2000 imgs).

universome commented 2 years ago

Hi! To be honest, I believe that the issue you report does not seem to be easily fixable. I attribute it to the fact that the generator uses just a single 512-dimensional content code (w) while you are trying to generate an "infinite" amount of different content from it. But there are other factors at play as well.

To mitigate it, I would try are the following things:

Is the dataset you are using available publicly?

JCBrouwer commented 2 years ago

Thanks for the in-depth response @universome !

I'll definitely have a look at some of your suggestions. It seems to me maybe it also makes sense to supply the w-code to the motion encoder. Some motions might only be valid for certain styles and not for others, but now the motion encoder does not have this information.

Have you seen Generating Long Videos of Dynamic Scenes? Looks very promising! Of course they're using much more compute because they work with dense spatio-temporal representations all the way through. Perhaps still some of their temporal-coherency-focused ideas can be ported over into the motion encoder here for gains.