Closed nicollegah closed 1 year ago
@nicollegah, actually you could also try to make use of the depth2image pipeline and pass a custom depth mask: https://github.com/huggingface/diffusers/blob/fc8afa3ab5eb840ab0da5aadb629bf671eef9a39/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py#L429 which is adapted to your use case. In this sense depth2image is a combination of img2img and inpainting.
Maybe this could help? :-)
I am experimenting with something similar right now. If you are amenable to external tools, its quite easy to generate a "motion mask" in vapoursynth so that unchanged parts of the scene remain static in the frames. There is a simple motionmask plugin and a more involved way to do it with mvtools.
There are additional postprocessing plugins like reduceflicker or temporalsoften than also help with coherence.
I am also working with a modification if the img2img pipeline that takes 2 frames instead of one (the current frame, and the previous, processed, motion compensated frame), and merges them in latent space before feeding them to SD to try and get some temporal coherence out of it.
PS: it would be nice if the img2img pipeline took a "latents" input like txt2img does, to make generation more consistent between frames by feeding it the same latent noise.
I am experimenting with something similar right now. If you are amenable to external tools, its quite easy to generate a "motion mask" in vapoursynth so that unchanged parts of the scene remain static in the frames. There is a simple motionmask plugin and a more involved way to do it with mvtools.
There are additional postprocessing plugins like reduceflicker or temporalsoften than also help with coherence.
I am also working with a modification if the img2img pipeline that takes 2 frames instead of one (the current frame, and the previous, processed, motion compensated frame), and merges them in latent space before feeding them to SD to try and get some temporal coherence out of it.
Thanks for your answer. All this sounds very good. Particularly the last idea sounds brilliant. Would be very keen on learning from you :-) I will look into mvtools. But I wouldn't mind manipulating the mask myself if I could only apply one.
For coherence, Deforum adds noise in the individual steps and the noise is changing every iteration - but I don't really understand how and why this works.
@nicollegah, actually you could also try to make use of the depth2image pipeline and pass a custom depth mask:
which is adapted to your use case. In this sense depth2image is a combination of img2img and inpainting. Maybe this could help? :-)
Interesting idea. I will try it out.
Deforum is txt2vid (or img2vid?), so its strategy is much different than vid2vid..
If the txt2vid/img2vid are what you are trying to do, then I think misinterpreted, and you are maybe better off staying within pytorch rather than jumping out to vapoursynth.
@nicollegah, actually you could also try to make use of the depth2image pipeline and pass a custom depth mask:
which is adapted to your use case. In this sense depth2image is a combination of img2img and inpainting. Maybe this could help? :-)
But this means I cannot use the normal v1.5 model and its awesome fine-tuned counterparts, right? I would really prefer to use one of the "standard" models.
And yes, I want to do text-to-video.
PS: it would be nice if the img2img pipeline took a "latents" input like txt2img does, to make generation more consistent between frames by feeding it the same latent noise.
I think this would be very nice indeed! Would love to review a PR for it :-) Will also assign myself in case I find time (probably not in the next 2 weeks though)
But this means I cannot use the normal v1.5 model and its awesome fine-tuned counterparts, right? I would really prefer to use one of the "standard" models.
And yes, I want to do text-to-video.
@nicollegah
Yeah that is my dilemma as well. The hundreds (probably thousands by now?) of community models are all 3-channel no-mask models, so a single latent space input image is all you got. InvokeAI and the Automatic UI can seemingly use the community models for inpainting, but I'm not sure how they manage it.
One other thing you can work with (without switching to inpainting models) is the input prompt. You can take the previous frame as a "style input" like the image variation pipeline does, either using it exclusively and merging it with the text prompt, and maybe that will make the output conform to the previous frame more closely.
Isn't StableDiffusionInpaintPipelineLegacy
basically a masked img2img? From what I see it uses the same approach as Deforum or Automatic's webui: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py#L661
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Isn't
StableDiffusionInpaintPipelineLegacy
basically a masked img2img? From what I see it uses the same approach as Deforum or Automatic's webui: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py#L661
This is exactly what I'm looking for! Thanks bro.
I am having a looped img2img pipeline to produce videos - similar to deforum. For that, I would love some objects in the scene to remain unchanged. The way to do this, is to use a mask. With huggingface components, the only way I am aware of achieving this is the inpainting models - however that is a different thing. I want a certain part of the images to remain unchanged, but the normal img2img pipeline acting on the rest.
There is the callback function in the img2img pipeline and my guess is, that one could use that to achieve what I am asking for, but I am unsure how.
The deforum script somehow does it (I think here: https://github.com/HelixNGC7293/DeforumStableDiffusionLocal/blob/17bf2f9b07167f8958065fa1b322ff83029d95b2/deforum-stable-diffusion/helpers/generate.py#L268 although there are several similar lines in the repo).
Thanks for your amazing work.