sherwinbahmani / 4dfy

4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
https://sherwinbahmani.github.io/4dfy/
Apache License 2.0
288 stars 7 forks source link

More motion #9

Closed ByChelsea closed 2 months ago

ByChelsea commented 4 months ago

Hi, according to another issue, I increase the system.loss.lambda_sds_video towards 1.0. Because based on the results displayed in the gallery, "a dog riding a skateboard" doesn't seem to have much motion, so I tried "a panda dancing".

When training for 50,000 epochs with system.loss.lambda_sds_video=0.1, I got,

https://github.com/sherwinbahmani/4dfy/assets/67221921/af148f19-03ad-49e5-9581-a957c921904b

When training for 50,000 epochs with system.loss.lambda_sds_video=1.0, I got,

https://github.com/sherwinbahmani/4dfy/assets/67221921/8389ef21-1dbd-4c75-b985-b38065b22e39

And I found that, when training for 10,000 epochs with system.loss.lambda_sds_video=0.1,

https://github.com/sherwinbahmani/4dfy/assets/67221921/c56a434b-3eed-4f41-8404-59bba98304c3

when training for 10,000 epochs with system.loss.lambda_sds_video=1.0,

https://github.com/sherwinbahmani/4dfy/assets/67221921/63dcec1d-d7f1-40ed-aa56-c8e3b60022cd

Thus, I guess that perhaps when using a small system.loss.lambda_sds_video, it requires longer training time to obtain more motion. However, when the training time is long enough, the impact of system.loss.lambda_sds_video seems not to be significant?

sherwinbahmani commented 4 months ago

Hi,

Yes for the case you are showing it does not seem to have a huge difference, but I remember it made differences for some prompt when testing it. I am currently experimenting with Videocrafter2 as a video model guidance, this should improve motion a lot. I will get back to that, as soon as I have first results.

I still think one bottleneck is the 4D representation, some better disentangled approach there could help there instead of using the two hashgrids.

ByChelsea commented 4 months ago

Thanks a lot for the reply and I'm looking forward to the results~

Additionally, what do you think are the limitations of 4D representations? Is it that they cannot capture more motion? AYG is a disentangled approach, but I don't see their motion being significantly more than 4dfy, and their method also includes "extended autoregressive generation" (which extends their dynamic time) and "motion amplification", which could be helpful in increasing motion. Of course, I'm not suggesting that hashgrids are the best representation method; I'm just considering aspects in relation to the issue of motion.

I hope my expression is clear. Thank you again for the discussion!

sherwinbahmani commented 4 months ago

I think the limitations of the representation are that you often have to constrain the motion too much. So we use alternating optimization with image diffusion models to constrain the motion and keep quality higher. Without any image guidance (you can test it by setting prob_multi_view=0.0 and prob_single_view_video: 1.0) the motion will be very high and it will converge within 10000 iterations. But then the quality goes down for our non disentangled approach. For AYG, they have to regularize the disentangled approach with losses, which in the end also compensates the motion amount. So coming up with a representation/losses that don't need to be regularized heavily would show much higher motion.

On the other side, the text-to-video models themselves don't have super high motion, but rather camera motion in their videos. So nvidia has a much better internal video model than what is available online right now, so they can get more motion. But video models are improving and plugging in new ones should boost the motion amount orthogonally.

ByChelsea commented 4 months ago

That makes sense and thanks for your insight! I'll run some tests to explore more.

sherwinbahmani commented 2 months ago

I included VideoCrafter2 video guidance for potentially more motion and a deformation based approach now. Feel free to open another issue.