nateraw / stable-diffusion-videos

Create 🔥 videos with Stable Diffusion by exploring the latent space and morphing between text prompts
Apache License 2.0
4.43k stars 422 forks source link

Add depth model from MiDas for getting depth tensor to create 3D animations #53

Open danielpatrickhug opened 2 years ago

danielpatrickhug commented 2 years ago

Add functionality for calculation of depth tensors as seen here. https://github.com/deforum/stable-diffusion/blob/main/helpers/depth.py also Affine geometric transformation functions(translations, rotations, scaling) would be a cool addition and project.

danielpatrickhug commented 2 years ago

Credit to Deforum community and more- Usage displayed here. https://colab.research.google.com/github/deforum/stable-diffusion/blob/main/Deforum_Stable_Diffusion.ipynb

video explanation: https://www.youtube.com/watch?v=F1bk9OXOmow

spaces example: https://huggingface.co/spaces/akhaliq/DPT-Large would be cool with textual inversion.

this may be a mountain of a request.

0x1355 commented 2 years ago

I am trying out deforum. Will take a look at this.

Seems an epic. But maybe we can do this piece by piece.

danielpatrickhug commented 2 years ago

@0x1355 Hi! potentially we can start with a 2d transformation, like perpetual outpainting in the y direction.

nateraw commented 2 years ago

I wonder how much these transformations will effect the quality of the interpolation videos, though? I don't really know what the difference is between what deforum is doing for frame interpolation vs what we are doing here. I am more of a fan of the interpolations happening here, as they look "cleaner + smoother" to me (but I'm probably biased😅)

If we were to add something like this, would it be doing something different for the community than what deforum has already done?

danielpatrickhug commented 2 years ago

You're right they may not be as clean as they are now at first. from what I gathered that part of the deform code for the 3d transformation came from the disco diffusion library and https://twitter.com/gandamu_ml as cited here.

I thought it would be good to rewrite some of the ideas in a more explicit fashion, like in a pipeline as It took me a long time to understand what was going on in that notebook. But a lot of the ideas are cool and could be expanded on. Also, a python package format would be good for modularity and redundancy sake and maybe we get lucky and improve or learn something new :).

0x1355 commented 2 years ago

Back from deforum land. Will put my thoughts in words tomorrow.

0x1355 commented 2 years ago

TL;DR Low impact. High effort. Personally I would prioritize for something else for now.

Low impact deforum has 2D, 3D, video_input, and interpolation animation modes. In comparison, sd-videos only does interpolation at the moment - but it is smoother due to different implementation. See this example:

https://user-images.githubusercontent.com/4979897/193022967-14d7b407-6495-44bd-ab2e-d4cc6f91e980.mp4

Adding 3D animation, at least if we do it the same way as deforum, will result in a 3D mode that is, at best, as good as deforum. This doesn't add much value to sd-videos users. They can just use deforum for that.

High effort What if we do it differently? Possible, but not easy.

deforum 2D/3D animation mode does the following:

  1. Render a key frame
  2. Transform the frame according to animation movement settings
  3. Use transformed frame as initial image for the next key frame, with adjusted number of steps
  4. Create tween frames by blending the two key frames (NOT interpolating)
  5. Repeat

Repeated initial image and blending help with coherence. But the flip side is that longer and slower movement videos tend to degrading into artifacts like lines and patterns, like at the end of this video:

https://user-images.githubusercontent.com/4979897/193010083-9cc8af48-a950-4906-8c2d-ff2f1c5dc91f.mp4

This issue has been on their radar for a while, but they haven't been able to find a better solution yet. This suggests high effort.

Prioritize for something else

I use sd-videos to make longer and slower videos. The two most common issues I see are:

  1. Flickering
  2. Inconsistent pace of interpolation: little change for a while, then suddenly a lot going on. #52 can be used to solve this but not in a direct way for most users.

Personally I prefer to work on things like the above, where we can add more value.

It doesn't mean I don't like or want to work on a 3D mode. Just a lower priority for me now.

nateraw commented 2 years ago

Thank you so much yet again @0x1355 for your comprehensive deep dive into another issue here. I'm going to mark this as low priority/not going to solve for now.